DELE CA2 PART B (Reinforcement Learning)¶


Done by: Kent Chua Yi Jie (P2415675) & Goh Yu Jie (P2415901)

Class: DAAA/FT/1B/04


Tasks for RL:¶

  • Apply a suitable modification of deep Q-network (DQN) architecture to the problem.

  • Your model should exert some appropriate torque on the pendulum to balance it.

  • You may consider other reinforcement learning architectures, if you wish, but only after

successfully implementing DQN. Otherwise, any other non-DQN architecture will be rejected.

  • In your work, you should plan clearly what you are doing, your approaches, and how

you systematically optimise your solutions.

- For example, what hyperparameters can you tune? Is one trial enough or should you repeat the trials? Why?

- How do you conclusively demonstrate your so-called “best setup” to be the best? Are you considering fastest learning, most stable learning, or some other criteria that you choose to define?

In [5]:
import gym
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model, optimizers
from tensorflow.keras.optimizers import Adam
import numpy as np
import random
from collections import deque
import os
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
import imageio
from IPython.display import Image, display
import random
import time
from collections import deque
from tensorflow.keras.models import Sequential
import json
from tensorflow import keras
from tensorflow.keras import layers
import seaborn as sns
from scipy import stats
import pandas as pd

Environment Exploration (EDA)¶

Objectives of EDA in RL:

  • Understand the observation space (what the agent sees)

  • Understand the action space (how the agent can interact)

  • Understand the reward structure (how success is measured)

  • Visualize the environment's behavior

In [2]:
# Create the environment (using specified version)
env = gym.make('Pendulum-v0')  

Selected v0 since in the brief it states

  • "NOTE: Stay within the older version of gym 0.17.3, as implemented in the lab for cartpole."
In [2]:
import gym
print(f"Gym version: {gym.__version__}")
env = gym.make('Pendulum-v0')
print(f"Environment spec: {env.spec}")
Gym version: 0.17.3
Environment spec: EnvSpec(Pendulum-v0)

1) Observation Space Analysis¶


What is observation space?

  • In reinforcement learning, the "observation space" defines the set of all possible states that an agent can perceive from its environment.

  • We can think of it as the information the agent receives at each timestep to make a decision (i.e., choose an action). The observation space can be a single value, a vector, a matrix (like an image), or even a more complex data structure.

Investigating shape and sample¶

In [4]:
print("Shape:", env.observation_space.shape)
Shape: (3,)
In [5]:
print("Sample observation:", env.observation_space.sample())
Sample observation: [ 0.3045093  -0.30357552 -3.5209098 ]
  • The shape of 3 tell us that the environment is a vector of 3 numbers

  • At every timestep, your agent will receive a list of three values.

High and low bounds¶

In [6]:
print("High bounds:", env.observation_space.high)
print("Low bounds:", env.observation_space.low)
High bounds: [1. 1. 8.]
Low bounds: [-1. -1. -8.]
  • The above information tell us the range of values for each of the three numbers in your observation vector.

First Value and Second Value (cos(θ) and sin(θ))

  • The angle of the pendulum (θ) is measured from the upright position. However, instead of providing the angle directly, the environment provides its sine and cosine values.

  • This is a common practice because it provides a continuous representation of the angle, avoiding the discontinuity that would occur if the angle was represented as a single value from, say, −π to π.

  • Both cos(θ) and sin(θ) are always between -1 and 1, which matches your output

Third Value (Angular Velocity)

  • This is how fast the pendulum is swinging and in what direction.

  • A positive value means it's swinging one way, and a negative value means it's swinging the other.

  • The environment has a maximum angular velocity of 8 radians/second, which is why our high bound is 8 our your low bound is -8..

Why do we use sin or cos

  • Using a single number to represent the pendulum's angle, like from −π to π, creates a big problem for a neural network. When the pendulum swings past the bottom, its angle value suddenly jumps from a positive number close to π to a negative number close to −π, even though the physical change is tiny. This sudden, non-smooth jump in the data confuses the neural network and makes it difficult for the model to learn a stable and effective control strategy.

  • By using both the sine and cosine of the angle instead, the environment provides two values that together form a continuous and smooth representation of the pendulum's position on a circle, which is much easier for the neural network to learn from and makes the training process more reliable.


2) Action Space analysis¶


What is Action Space

  • The "action space" defines the set of all possible actions that the reinforcement learning agent can take in the environment.

  • In the context of this project, an action is the torque I apply to the pendulum to try and balance it

Finding out the type¶

In [7]:
print("Type:", env.action_space)
Type: Box(-2.0, 2.0, (1,), float32)

What I can observe (box space)

  • A Box space represents a continuous range of values. The numbers inside the parentheses provide more detail:

    • -2.0 and 2.0: These are the minimum and maximum possible values for the action.

    • (1,): This is the shape of the action vector, indicating it's a single-dimensional vector.

    • float32: This is the data type of the action values.

Finding the shape¶

In [8]:
print("Shape:", env.action_space.shape)
Shape: (1,)
  • This confirms that the action is a single number. At each timestep, our agent must choose one value to apply as torque.

High and low bounds¶

In [9]:
print("High bound:", env.action_space.high)
High bound: [2.]
In [10]:
print("Low bound:", env.action_space.low)
Low bound: [-2.]

Observations:

  • The torque applied to the pendulum can be any real number between -2.0 and 2.0, inclusive.

    • A value of 2.0 represents the maximum positive torque (pushing the pendulum in one direction).

    • A value of -2.0 represents the maximum negative torque (pushing the pendulum in the opposite direction).

    • A value of 0.0 represents no torque being applied.

In [11]:
print("Sample action:", env.action_space.sample())
Sample action: [1.9702554]

3) Reward Structure¶


The reward function is defined as:

r = -(theta2 + 0.1 * theta_dt2 + 0.001 * torque2)

where theta is the pendulum’s angle normalized between [-pi, pi] (with 0 being in the upright position). Based on the above equation, the minimum reward that can be obtained is -(pi2 + 0.1 * 82 + 0.001 * 22) = -16.2736044, while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).

What does this actually mean

  • The negative sign at the beginning means this is essentially a penalty system. The agent's goal is to get a score as close to zero as possible.

  • Each term inside the parentheses represents a different aspect of the pendulum's state or the agent's action that incurs a penalty.

      1. Penalty for Angle: θ^2

        • θ is the pendulum's angle, where 0 is the upright position. This term penalizes the agent for how far the pendulum is from being perfectly upright.

        • Ex.If the pendulum is far from the top (e.g., hanging down), θ^2 will be a large positive number, resulting in a large penalty. If the pendulum is perfectly upright, θ is 0, and this term becomes 0, giving no penalty.

      1. Penalty for Angular Velocity: 0.1 x θ(dot)^2

        • (theta-dot) is the angular velocity, or how fast the pendulum is swinging. The coefficient of 0.1 means this penalty is 10 times less important than the angle penalty.

        • Ex. Even if the pendulum is at the upright position (θ=0), if it's swinging wildly (θ(dot) is large) the agent will still receive a penalty. The agent learns that it's not enough to be upright; it must also be still.

      1. Penalty for Torque: 0.001⋅τ^2

        • τ (tau) is the torque, which is the action your agent takes. The coefficient of 0.001 makes this a very small penalty.

        • This term encourages the agent to be efficient. It's a small penalty for using torque. The agent is discouraged from frantically applying maximum torque back and forth. Instead, it's incentivized to find a low-effort, stable solution that doesn't require a lot of energy.


4) Visualing the environment¶


In [12]:
import gym
import imageio
from IPython.display import Image, display

def record_and_log_pendulum_v0(gif_path='pendulum_v0_EDA.gif', max_steps=200):
    env = gym.make('Pendulum-v0')
    obs = env.reset()

    frames = []
    total_reward = 0
    step_log = []

    for step in range(max_steps):
        frame = env.render(mode='rgb_array')  # Get the frame
        frames.append(frame)

        action = env.action_space.sample()  # Random action from [-2, 2]
        obs, reward, done, info = env.step(action)  # Step with the action

        total_reward += reward  # Accumulate reward

        # log relevant parts of the observation (cosθ, sinθ, and angular velocity)
        cos_theta, sin_theta, theta_dot = obs

        step_log.append({
            "Step": step,
            "Action": float(action),
            "Reward": float(reward),
            "cos(θ)": cos_theta,
            "sin(θ)": sin_theta,
            "θ_dot": theta_dot
        })

        # Pendulum doesn't have early termination, but just in case
        if done:
            break

    env.close()

    # Save GIF
    imageio.mimsave(gif_path, frames, fps=30)
    display(Image(filename=gif_path))

    # Show reward and step logs
    print(f"\nEpisode finished after {step+1} steps")
    print(f"Total reward: {total_reward:.2f}\n")

    print("Step-by-step log (first 10 steps):")
    for log in step_log[:10]:
        print(log)
In [13]:
record_and_log_pendulum_v0()
<IPython.core.display.Image object>
Episode finished after 200 steps
Total reward: -1298.13

Step-by-step log (first 10 steps):
{'Step': 0, 'Action': -1.3534175157546997, 'Reward': -1.601543104867841, 'cos(θ)': 0.28254386725064784, 'sin(θ)': -0.9592543787124708, 'θ_dot': -0.5174504784979528}
{'Step': 1, 'Action': -0.31267061829566956, 'Reward': -1.676431596319763, 'cos(θ)': 0.22043011119490774, 'sin(θ)': -0.9754027712071566, 'θ_dot': -1.2837918552766563}
{'Step': 2, 'Action': -1.8988834619522095, 'Reward': -1.9869805260381432, 'cos(θ)': 0.10704111296591183, 'sin(θ)': -0.9942545952295211, 'θ_dot': -2.3001764529748554}
{'Step': 3, 'Action': 0.9905000329017639, 'Reward': -2.6720401131251887, 'cos(θ)': -0.03760916461093903, 'sin(θ)': -0.9992925251082724, 'θ_dot': -2.897292394461732}
{'Step': 4, 'Action': 0.5048401951789856, 'Reward': -3.4266819489379587, 'cos(θ)': -0.2144901727381946, 'sin(θ)': -0.9767261467774575, 'θ_dot': -3.5710357590160884}
{'Step': 5, 'Action': -1.8816505670547485, 'Reward': -4.472018273062022, 'cos(θ)': -0.43087413836177046, 'sin(θ)': -0.902412032771617, 'θ_dot': -4.585827954157394}
{'Step': 6, 'Action': 0.22481794655323029, 'Reward': -6.168326848533943, 'cos(θ)': -0.6494849951115621, 'sin(θ)': -0.7603744085152617, 'θ_dot': -5.228914286753122}
{'Step': 7, 'Action': 1.2767332792282104, 'Reward': -7.92371666439053, 'cos(θ)': -0.8345366173547326, 'sin(θ)': -0.5509524791614251, 'θ_dot': -5.607685101255337}
{'Step': 8, 'Action': -0.5608370304107666, 'Reward': -9.6887395082016, 'cos(θ)': -0.9615365218091017, 'sin(θ)': -0.2746771144949192, 'θ_dot': -6.10502501518802}
{'Step': 9, 'Action': -1.30135977268219, 'Reward': -11.927535486436033, 'cos(θ)': -0.9988929786900832, 'sin(θ)': 0.04704059017118103, 'θ_dot': -6.506236816961538}
Stage What Happens
Environment created Fresh simulation environment with no actions taken yet
Reset Pendulum is put at a random angle, first state is given
Loop starts We run the environment for max_steps = 200
Random action A random torque is applied to swing/spin the pendulum
Observation We get new state: cos(θ), sin(θ), angular velocity
Reward Calculated based on how upright and still the pendulum is
Frame render Frame is saved for the GIF
Repeat The new state becomes the current state, and the process repeats
Track reward Each reward adds up — total tells us how “well” the pendulum did overall
In [14]:
def make_constant_action_gif(action_torque, gif_path, env_name='Pendulum-v0', max_steps=200):
    env = gym.make(env_name)
    state = env.reset()
    frames = []

    for _ in range(max_steps):
        frame = env.render(mode='rgb_array')
        frames.append(frame)
        action = np.array([action_torque])  # Pendulum expects a 1D array
        next_state, _, _, _ = env.step(action)
        state = next_state
    env.close()
    imageio.mimsave(gif_path, frames, duration=0.04)
    print(f"GIF saved to {gif_path}")

if __name__ == "__main__":
    for torque in [-2.0, -1.0, 1.0, 2.0]:
        fname = f"pendulum_v0_action_{torque:+.0f}.gif"
        make_constant_action_gif(torque, fname)
GIF saved to pendulum_v0_action_-2.gif
GIF saved to pendulum_v0_action_-1.gif
GIF saved to pendulum_v0_action_+1.gif
GIF saved to pendulum_v0_action_+2.gif

5) Observation Value Distribution¶


Why is this important

  1. Neural Network Input Scaling:
    • Knowing the range and typical values of observations helps decide whether I need to normalize or standardize inputs to my network, which can speed up and stabilize learning.

  1. State Coverage:
    • I see which states are common under random actions, and whether the state space is fully explored or concentrated in certain regions.

  1. Feature Importance:
    • If some observation dimensions (e.g., θ_dot) rarely change, they may be less important, or the network can focus more on others.
In [15]:
n_episodes = 20
max_steps = 200

obs_buffer = []

for ep in range(n_episodes):
    obs = env.reset()
    obs = obs if isinstance(obs, np.ndarray) else obs[0]
    for _ in range(max_steps):
        obs_buffer.append(obs)
        action = env.action_space.sample()
        obs, reward, done, info = env.step(action)
        obs = obs if isinstance(obs, np.ndarray) else obs[0]
        if done:
            break

env.close()
obs_buffer = np.array(obs_buffer)
plt.figure(figsize=(12, 4))
plt.subplot(1,3,1)
plt.hist(obs_buffer[:,0], bins=50)
plt.title('cos(θ) distribution')
plt.subplot(1,3,2)
plt.hist(obs_buffer[:,1], bins=50)
plt.title('sin(θ) distribution')
plt.subplot(1,3,3)
plt.hist(obs_buffer[:,2], bins=50)
plt.title('θ_dot distribution')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation

  1. cos(θ) Distribution
  • Skewed towards -1
  • In physical terms, the pendulum is hanging down most of the time.
  • It's hard for random actions to keep pendulum upright
  1. sin(θ) Distribution
  • It has a bimodal shape which Peaks at -1 and +1, with values spread across the range.
  • Random actions spin it all over, so wesee values across the range.
  1. θ_dot Distribution
  • We see a normal shape centred at 0
  • Most of the time, the pendulum rotates slowly (angular velocity near zero), but sometimes random actions make it swing faster.
  • Random actions usually don’t keep it spinning fast, so θ_dot stays close to zero most of the time.

6) Reward Distribution¶


It shows the immediate reward I get from each action in each timestep

Why is this important

  1. Baseline Performance:

    • Shows how “bad” a random agent is, so I can compare my RL agent later.
  2. Reward Scaling:

    • Helps me decide if I need to scale or shift rewards for stable DQN learning.
  3. Reward Target:

    • If maximum possible reward is 0, but random actions always get -100 to -1000, I will then know what realistic goals are for improvement.
In [16]:
rewards = []
for _ in range(20):
    obs = env.reset()
    obs = obs if isinstance(obs, np.ndarray) else obs[0]
    for _ in range(200):
        action = env.action_space.sample()
        obs, reward, done, info = env.step(action)
        obs = obs if isinstance(obs, np.ndarray) else obs[0]
        rewards.append(reward)
        if done: break
env.close()
plt.hist(rewards, bins=50)
plt.title('Reward Distribution (Random Policy)')
plt.xlabel('Reward')
plt.ylabel('Count')
plt.show()
print("Reward min/max:", min(rewards), max(rewards))
No description has been provided for this image
Reward min/max: -16.180766869664986 -0.00179394192202261

Observations

  • Most rewards are between -11 and 0, with some more negative values.
  • They are mostly negative which indicates that the random actions do a poor job of balancing the pendulum

7) Episode Return Distribution¶


This is the total reward across an entire episode (sum of rewards for all steps in one episode).

Why is this important

  1. It shows whether the agent is learning to consistently improve episode return
In [17]:
episode_returns = []
for _ in range(20):
    obs = env.reset()
    obs = obs if isinstance(obs, np.ndarray) else obs[0]
    total = 0
    for _ in range(200):
        action = env.action_space.sample()
        obs, reward, done, info = env.step(action)
        obs = obs if isinstance(obs, np.ndarray) else obs[0]
        total += reward
        if done: break
    episode_returns.append(total)
env.close()

plt.hist(episode_returns, bins=20)
plt.title('Episode Returns (Random Policy)')
plt.xlabel('Total Reward')
plt.ylabel('Count')
plt.show()
print("Mean random policy return:", np.mean(episode_returns))
No description has been provided for this image
Mean random policy return: -1390.7370857671726

Observations

  • Most episodes score between -1600 and -800, with a mean around -1249.

Questions to better understand ¶


1) What are the 2 action spaces?¶

  • Discrete Action Space –> Only certain fixed actions are allowed.

    • Example: In CartPole, you can move left or right.

    • Represented like: action = 0 or action = 1.

  • Continuous Action Space –> Actions can take any value within a range (including decimals).

    • Example: In Pendulum-v0, the agent can apply a torque between -2.0 to 2.0.

    • So action = -1.47, 0.32, 2.00, etc. — any float value within that range.


2) Why DQN does not work on continuous action spaces¶

  • DQN (Deep Q-Network) is designed to assign values (Q-values) to discrete actions

  • It will work great when there a few fixed actions to choose from

  • HOWEVER if continuous, it will be impossible for DQN to evaluate every single possible float value from -2.0 to 2.0


3) What is Discretisation?¶

  • It means manually converting the continuous range into a set of fixed, allowed values

  • Instead of sampling any float, agent will choose from a list of actions that are finite


DEEP Q-Network¶


A Deep Q-Network (DQN) is a powerful and foundational algorithm in the field of reinforcement learning (RL). It's an extension of traditional Q-learning that uses deep neural networks to handle complex environments, like the Pendulum environment

We will conduct a systematic exploration of action space discretization to establish an optimized baseline configuration before evaluating algorithmic improvements

Cherry-Picking vs Optimizing

  • Cherry-picking: Running experiments, seeing results, then changing methodology to get better results

  • Optimizing: Systematically exploring configuration space, then using best configuration for all subsequent experiments


Step 1: Baseline DQN Implementation¶

Why do we need a baseline It is for

  1. Reference Point:

    • The baseline is our starting point—a simple, standard implementation I compare for all future improvements to.
  2. Debugging:

    • Ensures my code and setup work as expected before adding complexity.
  3. Performance Benchmark:

    • Shows what “basic DQN” can achieve so I can measure the impact of changes (like more actions, network size, or hyperparameters).

How do I determine best model?

  1. Mean Reward
  • This is the primary metric. The mean reward (or average return) over a set of evaluation episodes tells the agent's average performance with a greedy policy (no exploration)
  1. Reward Variance
  • This metric measures the consistency of the agent's performance
  1. Training vs. Evaluation Comparison
  • Comparison between training and evaluation performance. A significant difference can reveal if the agent has overfitted to a specific training environment or if its learned policy performs better without the noise of exploration.

Running the practical codes to have a general understanding¶

Small changes and improvements made ¶

  1. if not _ % ShowEvery:
    • NameError: name '_' is not defined
    • Instant crash when first render condition is checked
    • I added 'episode_num' parameter

  1. Input Shape Tuple Bug (Input(shape = (self.InputShape)))
    • Input(shape = (self.InputShape,))
    • Should add comma to create a proper tuple

  1. INFINITE LOOP CRASHES (Would Hang System) (while not Done)
    • Episodes run for thousands of steps, consuming all memor
    • Added 'step_count >= 200' then terminate

  1. LOGIC BUGS ActualTorque = (A / NActions - 0.5) * 4
    • The code above would only provide a range (-2,1.9)
    • A can be 0 to 39, so A/40 = 0 to 0.975.This gives range: (-2, 1.9) instead of (-2, 2)
    • ActualTorque = (A / (NActions - 1) - 0.5) * 4 WILL ALLOW for (-2,2)

  1. No Bounds Checking (ActualA = round((A + 2) * (NActions - 1) / 4))
    • Could be -1 or 40
    • ActualA = max(0, min(NActions - 1, ActualA)) to ensure always 0 to 39
In [28]:
class DQN:
    def __init__(self,
                 InputShape = 4,
                 NActions = 2,
                 Gamma = 1,
                 ReplayMemorySize = 10000,
                 MinReplayMemory = 1000,
                 UpdateTargetEveryThisEpisodes = 1,
                 IntermediateSize = 64,
                 BatchSize = 32):
        
        # Hyperparameters. #
        
        self.InputShape = InputShape
        self.NActions = NActions
        self.Gamma = Gamma
        self.ReplayMemorySize = ReplayMemorySize
        self.MinReplayMemory = MinReplayMemory
        self.UpdateTargetEveryThisEpisodes = UpdateTargetEveryThisEpisodes
        self.IntermediateSize = IntermediateSize
        self.BatchSize = BatchSize
        
        # Main model. #
        
        self.Main = self.CreateModel('Main')
        self.Optimiser = Adam()
        
        # Target model. #
        
        self.Target = self.CreateModel('Target')
        self.Target.set_weights(self.Main.get_weights())
        
        # Replay memory. #
        
        self.ReplayMemory = deque(maxlen = ReplayMemorySize)
        
        # Target network update counter. #
        
        self.TargetUpdateCounter = 0
    
    def CreateModel(self, Type):
        inputs = Input(shape = (self.InputShape,), name = 'Input')  # Fixed: Added comma
        x = Dense(self.IntermediateSize, activation = 'relu', name = '1stHiddenLayer')(inputs)
        x = Dense(self.IntermediateSize, activation = 'relu', name = '2ndHiddenLayer')(x)
        outputs = Dense(self.NActions, activation = 'linear', name = 'Output')(x)
        
        NN = Model(inputs, outputs, name = f'{Type}')
        NN.summary()
        
        return NN
    
    def UpdateReplayMemory(self, Information): # Information = (S, A, R, SNext, Done)
        self.ReplayMemory.append(Information)

    def Train(self, EndOfEpisode):
        
        # Only train if replay memory has enough data. #
        
        if len(self.ReplayMemory) < self.MinReplayMemory:
            print(f'DID NOT TRAIN..., replay memory = {len(self.ReplayMemory)}')
            return
        
        # Get batch of data for training. #
        
        TrainingData = random.sample(self.ReplayMemory, self.BatchSize)
        
        # Get states from training data, then get corresponding Q values. #
        
        ListOfS = np.array([element[0] for element in TrainingData])
        ListOfQ = np.array(self.Main(ListOfS))
        
        # Get future states from training data, then get corresponding Q values. #
        
        ListOfSNext = np.array([element[3] for element in TrainingData])
        ListOfQNext = self.Target(ListOfSNext)
        
        # Build actual training data for neural network. #
        
        X = []
        Y = []
        for index, (S, A, R, SNext, Done) in enumerate(TrainingData):
            if not Done:
                MaxQNext = np.max(ListOfQNext[index])
                QNext = R + self.Gamma * MaxQNext
            else:
                QNext = R
            Q = ListOfQ[index]
            Q[A] = QNext
        
            X.append(S)
            Y.append(Q)
        
        # Train model using tf.GradientTape(), defined below.
    
        self.GTfit(X, Y)
                
        # Update target network every episode. #
        
        if EndOfEpisode:
            self.TargetUpdateCounter += 1
        
        # Update target if counter is full. #
        
        if self.TargetUpdateCounter >= self.UpdateTargetEveryThisEpisodes:
            self.Target.set_weights(self.Main.get_weights())
            self.TargetUpdateCounter = 0
    
    # This is the tf.GradientTape() which significantly speeds up training of neural networks.
    
    @tf.function
    def GTfit(self, X, Y):
        
        # Train the neural network with this batch of data. #
        
        with tf.GradientTape() as tape:
            Predictions = self.Main(tf.convert_to_tensor(X), training = True)
            Loss = tf.math.reduce_mean(tf.math.square(tf.convert_to_tensor(Y) - Predictions))
        Grad = tape.gradient(Loss, self.Main.trainable_variables)
        self.Optimiser.apply_gradients(zip(Grad, self.Main.trainable_variables))
In [29]:
#Fixed hyperparameters and added missing variables
EnvName = 'Pendulum-v0'
IntermediateSize = 64
Epsilon = 0.1
ShowEvery = 10
InputShape = 3
NActions = 40
In [30]:
# Fixed action conversion functions
def PendulumActionConverter(A, NActions=NActions):
    ActualTorque = (A / (NActions - 1) - 0.5) * 4  # Fixed division
    return ActualTorque

def PendulumInverseActionConverter(A, NActions=NActions):
    ActualA = round((A + 2) * (NActions - 1) / 4)
    ActualA = max(0, min(NActions - 1, ActualA))  # Added bounds checking
    return ActualA
In [5]:
# Fixed training episode function
def OneEpisode(episode_num, epsilon=None, render=False):  # ADDED: epsilon parameter for evaluation
    if epsilon is None:
        epsilon = Epsilon  # Use global Epsilon for training
        
    env = gym.make(f'{EnvName}')
    S = env.reset()  # Pendulum-v0 returns only observation
    ListOfRewards = []
    Done = False
    step_count = 0
    
    while not Done:
        Q = DQN.Main(S.reshape(-1, S.shape[0]))
        if np.random.rand() < epsilon:  # CHANGED: Use parameter epsilon
            AStep = env.action_space.sample()
            A = PendulumInverseActionConverter(AStep[0])
        else:
            A = np.argmax(Q)
            AStep_torque = PendulumActionConverter(A)
            AStep = np.array([AStep_torque])
            A = PendulumInverseActionConverter(AStep_torque)
            
        # Fixed rendering condition
        if render and episode_num % ShowEvery == 0:  # CHANGED: Use render parameter
            env.render()
            
        SNext, R, Done, Info = env.step(AStep)
        
        # Only update replay memory during training (when epsilon > 0)
        if epsilon > 0:  # ADDED: Only train during training episodes
            DQN.UpdateReplayMemory((S, A, R, SNext, Done))
            DQN.Train(Done)
            
        ListOfRewards.append(R)
        step_count += 1
        
        # Added max step limit
        if step_count >= 200:
            Done = True
            
        if Done:
            total_reward = np.sum(ListOfRewards)
            if epsilon == 0:  # Evaluation episode
                print(f'Evaluation Episode {episode_num} finished! Return: {total_reward:.2f}')
            else:  # Training episode
                print(f'Training Episode {episode_num} finished! Return: {total_reward:.2f}')
            env.close()
            return total_reward
        S = SNext
In [6]:
# NEW: Evaluation function (epsilon=0, no training)
def EvaluateAgent(num_episodes=10, render=False):
    """
    Evaluate the trained agent with epsilon=0 (no exploration).
    This tests the learned policy without any random actions.
    """
    print(f"\n=== EVALUATION PHASE (Epsilon=0, No Training) ===")
    print(f"Running {num_episodes} evaluation episodes...")
    
    evaluation_rewards = []
    
    for episode in range(num_episodes):
        # Run episode with epsilon=0 (no exploration) and no training
        reward = OneEpisode(episode + 1, epsilon=0.0, render=render)
        evaluation_rewards.append(reward)
    
    # Calculate evaluation statistics
    mean_reward = np.mean(evaluation_rewards)
    std_reward = np.std(evaluation_rewards)
    min_reward = np.min(evaluation_rewards)
    max_reward = np.max(evaluation_rewards)
    
    print(f"\n=== EVALUATION RESULTS ===")
    print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
    print(f"Min Reward:  {min_reward:.2f}")
    print(f"Max Reward:  {max_reward:.2f}")
    print(f"All Rewards: {[f'{r:.1f}' for r in evaluation_rewards]}")
    
    return evaluation_rewards, mean_reward
In [25]:
# MAIN TRAINING AND EVALUATION LOOP
if __name__ == "__main__":
    import time
    STARTTIME = time.time()

    # Create DQN with better gamma value
    DQN = DQN(InputShape = InputShape, NActions = NActions, Gamma = 0.99)
    
    training_rewards = []  # Track training rewards
    
    print("=== TRAINING PHASE ===")
    for episode in range(150):
        print(f'Training Episode {episode + 1}')
        reward = OneEpisode(episode + 1, epsilon=Epsilon, render=False)  # Training with exploration
        training_rewards.append(reward)
        
        # Print running average every 10 episodes
        if (episode + 1) % 10 == 0:
            avg_reward = np.mean(training_rewards[-10:])
            print(f'Average training reward over last 10 episodes: {avg_reward:.2f}')

    training_time = time.time() - STARTTIME
    print(f'Training completed in: {training_time:.2f} seconds')
    
    # EVALUATION PHASE
    print(f'\nFinal training reward over last 10 episodes: {np.mean(training_rewards[-10:]):.2f}')
    
    # Evaluate the trained agent
    eval_rewards, eval_mean = EvaluateAgent(num_episodes=10, render=False)
    
    # COMPARISON
    print(f"\n=== TRAINING vs EVALUATION COMPARISON ===")
    training_final = np.mean(training_rewards[-10:])
    print(f"Final Training Performance (with epsilon={Epsilon}): {training_final:.2f}")
    print(f"Evaluation Performance (with epsilon=0.0):      {eval_mean:.2f}")
    
    improvement = eval_mean - training_final
    if improvement > 0:
        print(f"gent performs {improvement:.2f} points BETTER without exploration!")
    else:
        print(f"Agent performs {abs(improvement):.2f} points WORSE without exploration")
    
    print(f"\nTotal time (training + evaluation): {time.time() - STARTTIME:.2f} seconds")
Model: "Main"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 Input (InputLayer)          [(None, 3)]               0         
                                                                 
 1stHiddenLayer (Dense)      (None, 64)                256       
                                                                 
 2ndHiddenLayer (Dense)      (None, 64)                4160      
                                                                 
 Output (Dense)              (None, 40)                2600      
                                                                 
=================================================================
Total params: 7016 (27.41 KB)
Trainable params: 7016 (27.41 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Model: "Target"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 Input (InputLayer)          [(None, 3)]               0         
                                                                 
 1stHiddenLayer (Dense)      (None, 64)                256       
                                                                 
 2ndHiddenLayer (Dense)      (None, 64)                4160      
                                                                 
 Output (Dense)              (None, 40)                2600      
                                                                 
=================================================================
Total params: 7016 (27.41 KB)
Trainable params: 7016 (27.41 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
=== TRAINING PHASE ===
Training Episode 1
DID NOT TRAIN..., replay memory = 1
DID NOT TRAIN..., replay memory = 2
DID NOT TRAIN..., replay memory = 3
DID NOT TRAIN..., replay memory = 4
DID NOT TRAIN..., replay memory = 5
DID NOT TRAIN..., replay memory = 6
DID NOT TRAIN..., replay memory = 7
DID NOT TRAIN..., replay memory = 8
DID NOT TRAIN..., replay memory = 9
DID NOT TRAIN..., replay memory = 10
DID NOT TRAIN..., replay memory = 11
DID NOT TRAIN..., replay memory = 12
DID NOT TRAIN..., replay memory = 13
DID NOT TRAIN..., replay memory = 14
DID NOT TRAIN..., replay memory = 15
DID NOT TRAIN..., replay memory = 16
DID NOT TRAIN..., replay memory = 17
DID NOT TRAIN..., replay memory = 18
DID NOT TRAIN..., replay memory = 19
DID NOT TRAIN..., replay memory = 20
DID NOT TRAIN..., replay memory = 21
DID NOT TRAIN..., replay memory = 22
DID NOT TRAIN..., replay memory = 23
DID NOT TRAIN..., replay memory = 24
DID NOT TRAIN..., replay memory = 25
DID NOT TRAIN..., replay memory = 26
DID NOT TRAIN..., replay memory = 27
DID NOT TRAIN..., replay memory = 28
DID NOT TRAIN..., replay memory = 29
DID NOT TRAIN..., replay memory = 30
DID NOT TRAIN..., replay memory = 31
DID NOT TRAIN..., replay memory = 32
DID NOT TRAIN..., replay memory = 33
DID NOT TRAIN..., replay memory = 34
DID NOT TRAIN..., replay memory = 35
DID NOT TRAIN..., replay memory = 36
DID NOT TRAIN..., replay memory = 37
DID NOT TRAIN..., replay memory = 38
DID NOT TRAIN..., replay memory = 39
DID NOT TRAIN..., replay memory = 40
DID NOT TRAIN..., replay memory = 41
DID NOT TRAIN..., replay memory = 42
DID NOT TRAIN..., replay memory = 43
DID NOT TRAIN..., replay memory = 44
DID NOT TRAIN..., replay memory = 45
DID NOT TRAIN..., replay memory = 46
DID NOT TRAIN..., replay memory = 47
DID NOT TRAIN..., replay memory = 48
DID NOT TRAIN..., replay memory = 49
DID NOT TRAIN..., replay memory = 50
DID NOT TRAIN..., replay memory = 51
DID NOT TRAIN..., replay memory = 52
DID NOT TRAIN..., replay memory = 53
DID NOT TRAIN..., replay memory = 54
DID NOT TRAIN..., replay memory = 55
DID NOT TRAIN..., replay memory = 56
DID NOT TRAIN..., replay memory = 57
DID NOT TRAIN..., replay memory = 58
DID NOT TRAIN..., replay memory = 59
DID NOT TRAIN..., replay memory = 60
DID NOT TRAIN..., replay memory = 61
DID NOT TRAIN..., replay memory = 62
DID NOT TRAIN..., replay memory = 63
DID NOT TRAIN..., replay memory = 64
DID NOT TRAIN..., replay memory = 65
DID NOT TRAIN..., replay memory = 66
DID NOT TRAIN..., replay memory = 67
DID NOT TRAIN..., replay memory = 68
DID NOT TRAIN..., replay memory = 69
DID NOT TRAIN..., replay memory = 70
DID NOT TRAIN..., replay memory = 71
DID NOT TRAIN..., replay memory = 72
DID NOT TRAIN..., replay memory = 73
DID NOT TRAIN..., replay memory = 74
DID NOT TRAIN..., replay memory = 75
DID NOT TRAIN..., replay memory = 76
DID NOT TRAIN..., replay memory = 77
DID NOT TRAIN..., replay memory = 78
DID NOT TRAIN..., replay memory = 79
DID NOT TRAIN..., replay memory = 80
DID NOT TRAIN..., replay memory = 81
DID NOT TRAIN..., replay memory = 82
DID NOT TRAIN..., replay memory = 83
DID NOT TRAIN..., replay memory = 84
DID NOT TRAIN..., replay memory = 85
DID NOT TRAIN..., replay memory = 86
DID NOT TRAIN..., replay memory = 87
DID NOT TRAIN..., replay memory = 88
DID NOT TRAIN..., replay memory = 89
DID NOT TRAIN..., replay memory = 90
DID NOT TRAIN..., replay memory = 91
DID NOT TRAIN..., replay memory = 92
DID NOT TRAIN..., replay memory = 93
DID NOT TRAIN..., replay memory = 94
DID NOT TRAIN..., replay memory = 95
DID NOT TRAIN..., replay memory = 96
DID NOT TRAIN..., replay memory = 97
DID NOT TRAIN..., replay memory = 98
DID NOT TRAIN..., replay memory = 99
DID NOT TRAIN..., replay memory = 100
DID NOT TRAIN..., replay memory = 101
DID NOT TRAIN..., replay memory = 102
DID NOT TRAIN..., replay memory = 103
DID NOT TRAIN..., replay memory = 104
DID NOT TRAIN..., replay memory = 105
DID NOT TRAIN..., replay memory = 106
DID NOT TRAIN..., replay memory = 107
DID NOT TRAIN..., replay memory = 108
DID NOT TRAIN..., replay memory = 109
DID NOT TRAIN..., replay memory = 110
DID NOT TRAIN..., replay memory = 111
DID NOT TRAIN..., replay memory = 112
DID NOT TRAIN..., replay memory = 113
DID NOT TRAIN..., replay memory = 114
DID NOT TRAIN..., replay memory = 115
DID NOT TRAIN..., replay memory = 116
DID NOT TRAIN..., replay memory = 117
DID NOT TRAIN..., replay memory = 118
DID NOT TRAIN..., replay memory = 119
DID NOT TRAIN..., replay memory = 120
DID NOT TRAIN..., replay memory = 121
DID NOT TRAIN..., replay memory = 122
DID NOT TRAIN..., replay memory = 123
DID NOT TRAIN..., replay memory = 124
DID NOT TRAIN..., replay memory = 125
DID NOT TRAIN..., replay memory = 126
DID NOT TRAIN..., replay memory = 127
DID NOT TRAIN..., replay memory = 128
DID NOT TRAIN..., replay memory = 129
DID NOT TRAIN..., replay memory = 130
DID NOT TRAIN..., replay memory = 131
DID NOT TRAIN..., replay memory = 132
DID NOT TRAIN..., replay memory = 133
DID NOT TRAIN..., replay memory = 134
DID NOT TRAIN..., replay memory = 135
DID NOT TRAIN..., replay memory = 136
DID NOT TRAIN..., replay memory = 137
DID NOT TRAIN..., replay memory = 138
DID NOT TRAIN..., replay memory = 139
DID NOT TRAIN..., replay memory = 140
DID NOT TRAIN..., replay memory = 141
DID NOT TRAIN..., replay memory = 142
DID NOT TRAIN..., replay memory = 143
DID NOT TRAIN..., replay memory = 144
DID NOT TRAIN..., replay memory = 145
DID NOT TRAIN..., replay memory = 146
DID NOT TRAIN..., replay memory = 147
DID NOT TRAIN..., replay memory = 148
DID NOT TRAIN..., replay memory = 149
DID NOT TRAIN..., replay memory = 150
DID NOT TRAIN..., replay memory = 151
DID NOT TRAIN..., replay memory = 152
DID NOT TRAIN..., replay memory = 153
DID NOT TRAIN..., replay memory = 154
DID NOT TRAIN..., replay memory = 155
DID NOT TRAIN..., replay memory = 156
DID NOT TRAIN..., replay memory = 157
DID NOT TRAIN..., replay memory = 158
DID NOT TRAIN..., replay memory = 159
DID NOT TRAIN..., replay memory = 160
DID NOT TRAIN..., replay memory = 161
DID NOT TRAIN..., replay memory = 162
DID NOT TRAIN..., replay memory = 163
DID NOT TRAIN..., replay memory = 164
DID NOT TRAIN..., replay memory = 165
DID NOT TRAIN..., replay memory = 166
DID NOT TRAIN..., replay memory = 167
DID NOT TRAIN..., replay memory = 168
DID NOT TRAIN..., replay memory = 169
DID NOT TRAIN..., replay memory = 170
DID NOT TRAIN..., replay memory = 171
DID NOT TRAIN..., replay memory = 172
DID NOT TRAIN..., replay memory = 173
DID NOT TRAIN..., replay memory = 174
DID NOT TRAIN..., replay memory = 175
DID NOT TRAIN..., replay memory = 176
DID NOT TRAIN..., replay memory = 177
DID NOT TRAIN..., replay memory = 178
DID NOT TRAIN..., replay memory = 179
DID NOT TRAIN..., replay memory = 180
DID NOT TRAIN..., replay memory = 181
DID NOT TRAIN..., replay memory = 182
DID NOT TRAIN..., replay memory = 183
DID NOT TRAIN..., replay memory = 184
DID NOT TRAIN..., replay memory = 185
DID NOT TRAIN..., replay memory = 186
DID NOT TRAIN..., replay memory = 187
DID NOT TRAIN..., replay memory = 188
DID NOT TRAIN..., replay memory = 189
DID NOT TRAIN..., replay memory = 190
DID NOT TRAIN..., replay memory = 191
DID NOT TRAIN..., replay memory = 192
DID NOT TRAIN..., replay memory = 193
DID NOT TRAIN..., replay memory = 194
DID NOT TRAIN..., replay memory = 195
DID NOT TRAIN..., replay memory = 196
DID NOT TRAIN..., replay memory = 197
DID NOT TRAIN..., replay memory = 198
DID NOT TRAIN..., replay memory = 199
DID NOT TRAIN..., replay memory = 200
Training Episode 1 finished! Return: -936.33
Training Episode 2
DID NOT TRAIN..., replay memory = 201
DID NOT TRAIN..., replay memory = 202
DID NOT TRAIN..., replay memory = 203
DID NOT TRAIN..., replay memory = 204
DID NOT TRAIN..., replay memory = 205
DID NOT TRAIN..., replay memory = 206
DID NOT TRAIN..., replay memory = 207
DID NOT TRAIN..., replay memory = 208
DID NOT TRAIN..., replay memory = 209
DID NOT TRAIN..., replay memory = 210
DID NOT TRAIN..., replay memory = 211
DID NOT TRAIN..., replay memory = 212
DID NOT TRAIN..., replay memory = 213
DID NOT TRAIN..., replay memory = 214
DID NOT TRAIN..., replay memory = 215
DID NOT TRAIN..., replay memory = 216
DID NOT TRAIN..., replay memory = 217
DID NOT TRAIN..., replay memory = 218
DID NOT TRAIN..., replay memory = 219
DID NOT TRAIN..., replay memory = 220
DID NOT TRAIN..., replay memory = 221
DID NOT TRAIN..., replay memory = 222
DID NOT TRAIN..., replay memory = 223
DID NOT TRAIN..., replay memory = 224
DID NOT TRAIN..., replay memory = 225
DID NOT TRAIN..., replay memory = 226
DID NOT TRAIN..., replay memory = 227
DID NOT TRAIN..., replay memory = 228
DID NOT TRAIN..., replay memory = 229
DID NOT TRAIN..., replay memory = 230
DID NOT TRAIN..., replay memory = 231
DID NOT TRAIN..., replay memory = 232
DID NOT TRAIN..., replay memory = 233
DID NOT TRAIN..., replay memory = 234
DID NOT TRAIN..., replay memory = 235
DID NOT TRAIN..., replay memory = 236
DID NOT TRAIN..., replay memory = 237
DID NOT TRAIN..., replay memory = 238
DID NOT TRAIN..., replay memory = 239
DID NOT TRAIN..., replay memory = 240
DID NOT TRAIN..., replay memory = 241
DID NOT TRAIN..., replay memory = 242
DID NOT TRAIN..., replay memory = 243
DID NOT TRAIN..., replay memory = 244
DID NOT TRAIN..., replay memory = 245
DID NOT TRAIN..., replay memory = 246
DID NOT TRAIN..., replay memory = 247
DID NOT TRAIN..., replay memory = 248
DID NOT TRAIN..., replay memory = 249
DID NOT TRAIN..., replay memory = 250
DID NOT TRAIN..., replay memory = 251
DID NOT TRAIN..., replay memory = 252
DID NOT TRAIN..., replay memory = 253
DID NOT TRAIN..., replay memory = 254
DID NOT TRAIN..., replay memory = 255
DID NOT TRAIN..., replay memory = 256
DID NOT TRAIN..., replay memory = 257
DID NOT TRAIN..., replay memory = 258
DID NOT TRAIN..., replay memory = 259
DID NOT TRAIN..., replay memory = 260
DID NOT TRAIN..., replay memory = 261
DID NOT TRAIN..., replay memory = 262
DID NOT TRAIN..., replay memory = 263
DID NOT TRAIN..., replay memory = 264
DID NOT TRAIN..., replay memory = 265
DID NOT TRAIN..., replay memory = 266
DID NOT TRAIN..., replay memory = 267
DID NOT TRAIN..., replay memory = 268
DID NOT TRAIN..., replay memory = 269
DID NOT TRAIN..., replay memory = 270
DID NOT TRAIN..., replay memory = 271
DID NOT TRAIN..., replay memory = 272
DID NOT TRAIN..., replay memory = 273
DID NOT TRAIN..., replay memory = 274
DID NOT TRAIN..., replay memory = 275
DID NOT TRAIN..., replay memory = 276
DID NOT TRAIN..., replay memory = 277
DID NOT TRAIN..., replay memory = 278
DID NOT TRAIN..., replay memory = 279
DID NOT TRAIN..., replay memory = 280
DID NOT TRAIN..., replay memory = 281
DID NOT TRAIN..., replay memory = 282
DID NOT TRAIN..., replay memory = 283
DID NOT TRAIN..., replay memory = 284
DID NOT TRAIN..., replay memory = 285
DID NOT TRAIN..., replay memory = 286
DID NOT TRAIN..., replay memory = 287
DID NOT TRAIN..., replay memory = 288
DID NOT TRAIN..., replay memory = 289
DID NOT TRAIN..., replay memory = 290
DID NOT TRAIN..., replay memory = 291
DID NOT TRAIN..., replay memory = 292
DID NOT TRAIN..., replay memory = 293
DID NOT TRAIN..., replay memory = 294
DID NOT TRAIN..., replay memory = 295
DID NOT TRAIN..., replay memory = 296
DID NOT TRAIN..., replay memory = 297
DID NOT TRAIN..., replay memory = 298
DID NOT TRAIN..., replay memory = 299
DID NOT TRAIN..., replay memory = 300
DID NOT TRAIN..., replay memory = 301
DID NOT TRAIN..., replay memory = 302
DID NOT TRAIN..., replay memory = 303
DID NOT TRAIN..., replay memory = 304
DID NOT TRAIN..., replay memory = 305
DID NOT TRAIN..., replay memory = 306
DID NOT TRAIN..., replay memory = 307
DID NOT TRAIN..., replay memory = 308
DID NOT TRAIN..., replay memory = 309
DID NOT TRAIN..., replay memory = 310
DID NOT TRAIN..., replay memory = 311
DID NOT TRAIN..., replay memory = 312
DID NOT TRAIN..., replay memory = 313
DID NOT TRAIN..., replay memory = 314
DID NOT TRAIN..., replay memory = 315
DID NOT TRAIN..., replay memory = 316
DID NOT TRAIN..., replay memory = 317
DID NOT TRAIN..., replay memory = 318
DID NOT TRAIN..., replay memory = 319
DID NOT TRAIN..., replay memory = 320
DID NOT TRAIN..., replay memory = 321
DID NOT TRAIN..., replay memory = 322
DID NOT TRAIN..., replay memory = 323
DID NOT TRAIN..., replay memory = 324
DID NOT TRAIN..., replay memory = 325
DID NOT TRAIN..., replay memory = 326
DID NOT TRAIN..., replay memory = 327
DID NOT TRAIN..., replay memory = 328
DID NOT TRAIN..., replay memory = 329
DID NOT TRAIN..., replay memory = 330
DID NOT TRAIN..., replay memory = 331
DID NOT TRAIN..., replay memory = 332
DID NOT TRAIN..., replay memory = 333
DID NOT TRAIN..., replay memory = 334
DID NOT TRAIN..., replay memory = 335
DID NOT TRAIN..., replay memory = 336
DID NOT TRAIN..., replay memory = 337
DID NOT TRAIN..., replay memory = 338
DID NOT TRAIN..., replay memory = 339
DID NOT TRAIN..., replay memory = 340
DID NOT TRAIN..., replay memory = 341
DID NOT TRAIN..., replay memory = 342
DID NOT TRAIN..., replay memory = 343
DID NOT TRAIN..., replay memory = 344
DID NOT TRAIN..., replay memory = 345
DID NOT TRAIN..., replay memory = 346
DID NOT TRAIN..., replay memory = 347
DID NOT TRAIN..., replay memory = 348
DID NOT TRAIN..., replay memory = 349
DID NOT TRAIN..., replay memory = 350
DID NOT TRAIN..., replay memory = 351
DID NOT TRAIN..., replay memory = 352
DID NOT TRAIN..., replay memory = 353
DID NOT TRAIN..., replay memory = 354
DID NOT TRAIN..., replay memory = 355
DID NOT TRAIN..., replay memory = 356
DID NOT TRAIN..., replay memory = 357
DID NOT TRAIN..., replay memory = 358
DID NOT TRAIN..., replay memory = 359
DID NOT TRAIN..., replay memory = 360
DID NOT TRAIN..., replay memory = 361
DID NOT TRAIN..., replay memory = 362
DID NOT TRAIN..., replay memory = 363
DID NOT TRAIN..., replay memory = 364
DID NOT TRAIN..., replay memory = 365
DID NOT TRAIN..., replay memory = 366
DID NOT TRAIN..., replay memory = 367
DID NOT TRAIN..., replay memory = 368
DID NOT TRAIN..., replay memory = 369
DID NOT TRAIN..., replay memory = 370
DID NOT TRAIN..., replay memory = 371
DID NOT TRAIN..., replay memory = 372
DID NOT TRAIN..., replay memory = 373
DID NOT TRAIN..., replay memory = 374
DID NOT TRAIN..., replay memory = 375
DID NOT TRAIN..., replay memory = 376
DID NOT TRAIN..., replay memory = 377
DID NOT TRAIN..., replay memory = 378
DID NOT TRAIN..., replay memory = 379
DID NOT TRAIN..., replay memory = 380
DID NOT TRAIN..., replay memory = 381
DID NOT TRAIN..., replay memory = 382
DID NOT TRAIN..., replay memory = 383
DID NOT TRAIN..., replay memory = 384
DID NOT TRAIN..., replay memory = 385
DID NOT TRAIN..., replay memory = 386
DID NOT TRAIN..., replay memory = 387
DID NOT TRAIN..., replay memory = 388
DID NOT TRAIN..., replay memory = 389
DID NOT TRAIN..., replay memory = 390
DID NOT TRAIN..., replay memory = 391
DID NOT TRAIN..., replay memory = 392
DID NOT TRAIN..., replay memory = 393
DID NOT TRAIN..., replay memory = 394
DID NOT TRAIN..., replay memory = 395
DID NOT TRAIN..., replay memory = 396
DID NOT TRAIN..., replay memory = 397
DID NOT TRAIN..., replay memory = 398
DID NOT TRAIN..., replay memory = 399
DID NOT TRAIN..., replay memory = 400
Training Episode 2 finished! Return: -1569.04
Training Episode 3
DID NOT TRAIN..., replay memory = 401
DID NOT TRAIN..., replay memory = 402
DID NOT TRAIN..., replay memory = 403
DID NOT TRAIN..., replay memory = 404
DID NOT TRAIN..., replay memory = 405
DID NOT TRAIN..., replay memory = 406
DID NOT TRAIN..., replay memory = 407
DID NOT TRAIN..., replay memory = 408
DID NOT TRAIN..., replay memory = 409
DID NOT TRAIN..., replay memory = 410
DID NOT TRAIN..., replay memory = 411
DID NOT TRAIN..., replay memory = 412
DID NOT TRAIN..., replay memory = 413
DID NOT TRAIN..., replay memory = 414
DID NOT TRAIN..., replay memory = 415
DID NOT TRAIN..., replay memory = 416
DID NOT TRAIN..., replay memory = 417
DID NOT TRAIN..., replay memory = 418
DID NOT TRAIN..., replay memory = 419
DID NOT TRAIN..., replay memory = 420
DID NOT TRAIN..., replay memory = 421
DID NOT TRAIN..., replay memory = 422
DID NOT TRAIN..., replay memory = 423
DID NOT TRAIN..., replay memory = 424
DID NOT TRAIN..., replay memory = 425
DID NOT TRAIN..., replay memory = 426
DID NOT TRAIN..., replay memory = 427
DID NOT TRAIN..., replay memory = 428
DID NOT TRAIN..., replay memory = 429
DID NOT TRAIN..., replay memory = 430
DID NOT TRAIN..., replay memory = 431
DID NOT TRAIN..., replay memory = 432
DID NOT TRAIN..., replay memory = 433
DID NOT TRAIN..., replay memory = 434
DID NOT TRAIN..., replay memory = 435
DID NOT TRAIN..., replay memory = 436
DID NOT TRAIN..., replay memory = 437
DID NOT TRAIN..., replay memory = 438
DID NOT TRAIN..., replay memory = 439
DID NOT TRAIN..., replay memory = 440
DID NOT TRAIN..., replay memory = 441
DID NOT TRAIN..., replay memory = 442
DID NOT TRAIN..., replay memory = 443
DID NOT TRAIN..., replay memory = 444
DID NOT TRAIN..., replay memory = 445
DID NOT TRAIN..., replay memory = 446
DID NOT TRAIN..., replay memory = 447
DID NOT TRAIN..., replay memory = 448
DID NOT TRAIN..., replay memory = 449
DID NOT TRAIN..., replay memory = 450
DID NOT TRAIN..., replay memory = 451
DID NOT TRAIN..., replay memory = 452
DID NOT TRAIN..., replay memory = 453
DID NOT TRAIN..., replay memory = 454
DID NOT TRAIN..., replay memory = 455
DID NOT TRAIN..., replay memory = 456
DID NOT TRAIN..., replay memory = 457
DID NOT TRAIN..., replay memory = 458
DID NOT TRAIN..., replay memory = 459
DID NOT TRAIN..., replay memory = 460
DID NOT TRAIN..., replay memory = 461
DID NOT TRAIN..., replay memory = 462
DID NOT TRAIN..., replay memory = 463
DID NOT TRAIN..., replay memory = 464
DID NOT TRAIN..., replay memory = 465
DID NOT TRAIN..., replay memory = 466
DID NOT TRAIN..., replay memory = 467
DID NOT TRAIN..., replay memory = 468
DID NOT TRAIN..., replay memory = 469
DID NOT TRAIN..., replay memory = 470
DID NOT TRAIN..., replay memory = 471
DID NOT TRAIN..., replay memory = 472
DID NOT TRAIN..., replay memory = 473
DID NOT TRAIN..., replay memory = 474
DID NOT TRAIN..., replay memory = 475
DID NOT TRAIN..., replay memory = 476
DID NOT TRAIN..., replay memory = 477
DID NOT TRAIN..., replay memory = 478
DID NOT TRAIN..., replay memory = 479
DID NOT TRAIN..., replay memory = 480
DID NOT TRAIN..., replay memory = 481
DID NOT TRAIN..., replay memory = 482
DID NOT TRAIN..., replay memory = 483
DID NOT TRAIN..., replay memory = 484
DID NOT TRAIN..., replay memory = 485
DID NOT TRAIN..., replay memory = 486
DID NOT TRAIN..., replay memory = 487
DID NOT TRAIN..., replay memory = 488
DID NOT TRAIN..., replay memory = 489
DID NOT TRAIN..., replay memory = 490
DID NOT TRAIN..., replay memory = 491
DID NOT TRAIN..., replay memory = 492
DID NOT TRAIN..., replay memory = 493
DID NOT TRAIN..., replay memory = 494
DID NOT TRAIN..., replay memory = 495
DID NOT TRAIN..., replay memory = 496
DID NOT TRAIN..., replay memory = 497
DID NOT TRAIN..., replay memory = 498
DID NOT TRAIN..., replay memory = 499
DID NOT TRAIN..., replay memory = 500
DID NOT TRAIN..., replay memory = 501
DID NOT TRAIN..., replay memory = 502
DID NOT TRAIN..., replay memory = 503
DID NOT TRAIN..., replay memory = 504
DID NOT TRAIN..., replay memory = 505
DID NOT TRAIN..., replay memory = 506
DID NOT TRAIN..., replay memory = 507
DID NOT TRAIN..., replay memory = 508
DID NOT TRAIN..., replay memory = 509
DID NOT TRAIN..., replay memory = 510
DID NOT TRAIN..., replay memory = 511
DID NOT TRAIN..., replay memory = 512
DID NOT TRAIN..., replay memory = 513
DID NOT TRAIN..., replay memory = 514
DID NOT TRAIN..., replay memory = 515
DID NOT TRAIN..., replay memory = 516
DID NOT TRAIN..., replay memory = 517
DID NOT TRAIN..., replay memory = 518
DID NOT TRAIN..., replay memory = 519
DID NOT TRAIN..., replay memory = 520
DID NOT TRAIN..., replay memory = 521
DID NOT TRAIN..., replay memory = 522
DID NOT TRAIN..., replay memory = 523
DID NOT TRAIN..., replay memory = 524
DID NOT TRAIN..., replay memory = 525
DID NOT TRAIN..., replay memory = 526
DID NOT TRAIN..., replay memory = 527
DID NOT TRAIN..., replay memory = 528
DID NOT TRAIN..., replay memory = 529
DID NOT TRAIN..., replay memory = 530
DID NOT TRAIN..., replay memory = 531
DID NOT TRAIN..., replay memory = 532
DID NOT TRAIN..., replay memory = 533
DID NOT TRAIN..., replay memory = 534
DID NOT TRAIN..., replay memory = 535
DID NOT TRAIN..., replay memory = 536
DID NOT TRAIN..., replay memory = 537
DID NOT TRAIN..., replay memory = 538
DID NOT TRAIN..., replay memory = 539
DID NOT TRAIN..., replay memory = 540
DID NOT TRAIN..., replay memory = 541
DID NOT TRAIN..., replay memory = 542
DID NOT TRAIN..., replay memory = 543
DID NOT TRAIN..., replay memory = 544
DID NOT TRAIN..., replay memory = 545
DID NOT TRAIN..., replay memory = 546
DID NOT TRAIN..., replay memory = 547
DID NOT TRAIN..., replay memory = 548
DID NOT TRAIN..., replay memory = 549
DID NOT TRAIN..., replay memory = 550
DID NOT TRAIN..., replay memory = 551
DID NOT TRAIN..., replay memory = 552
DID NOT TRAIN..., replay memory = 553
DID NOT TRAIN..., replay memory = 554
DID NOT TRAIN..., replay memory = 555
DID NOT TRAIN..., replay memory = 556
DID NOT TRAIN..., replay memory = 557
DID NOT TRAIN..., replay memory = 558
DID NOT TRAIN..., replay memory = 559
DID NOT TRAIN..., replay memory = 560
DID NOT TRAIN..., replay memory = 561
DID NOT TRAIN..., replay memory = 562
DID NOT TRAIN..., replay memory = 563
DID NOT TRAIN..., replay memory = 564
DID NOT TRAIN..., replay memory = 565
DID NOT TRAIN..., replay memory = 566
DID NOT TRAIN..., replay memory = 567
DID NOT TRAIN..., replay memory = 568
DID NOT TRAIN..., replay memory = 569
DID NOT TRAIN..., replay memory = 570
DID NOT TRAIN..., replay memory = 571
DID NOT TRAIN..., replay memory = 572
DID NOT TRAIN..., replay memory = 573
DID NOT TRAIN..., replay memory = 574
DID NOT TRAIN..., replay memory = 575
DID NOT TRAIN..., replay memory = 576
DID NOT TRAIN..., replay memory = 577
DID NOT TRAIN..., replay memory = 578
DID NOT TRAIN..., replay memory = 579
DID NOT TRAIN..., replay memory = 580
DID NOT TRAIN..., replay memory = 581
DID NOT TRAIN..., replay memory = 582
DID NOT TRAIN..., replay memory = 583
DID NOT TRAIN..., replay memory = 584
DID NOT TRAIN..., replay memory = 585
DID NOT TRAIN..., replay memory = 586
DID NOT TRAIN..., replay memory = 587
DID NOT TRAIN..., replay memory = 588
DID NOT TRAIN..., replay memory = 589
DID NOT TRAIN..., replay memory = 590
DID NOT TRAIN..., replay memory = 591
DID NOT TRAIN..., replay memory = 592
DID NOT TRAIN..., replay memory = 593
DID NOT TRAIN..., replay memory = 594
DID NOT TRAIN..., replay memory = 595
DID NOT TRAIN..., replay memory = 596
DID NOT TRAIN..., replay memory = 597
DID NOT TRAIN..., replay memory = 598
DID NOT TRAIN..., replay memory = 599
DID NOT TRAIN..., replay memory = 600
Training Episode 3 finished! Return: -1591.65
Training Episode 4
DID NOT TRAIN..., replay memory = 601
DID NOT TRAIN..., replay memory = 602
DID NOT TRAIN..., replay memory = 603
DID NOT TRAIN..., replay memory = 604
DID NOT TRAIN..., replay memory = 605
DID NOT TRAIN..., replay memory = 606
DID NOT TRAIN..., replay memory = 607
DID NOT TRAIN..., replay memory = 608
DID NOT TRAIN..., replay memory = 609
DID NOT TRAIN..., replay memory = 610
DID NOT TRAIN..., replay memory = 611
DID NOT TRAIN..., replay memory = 612
DID NOT TRAIN..., replay memory = 613
DID NOT TRAIN..., replay memory = 614
DID NOT TRAIN..., replay memory = 615
DID NOT TRAIN..., replay memory = 616
DID NOT TRAIN..., replay memory = 617
DID NOT TRAIN..., replay memory = 618
DID NOT TRAIN..., replay memory = 619
DID NOT TRAIN..., replay memory = 620
DID NOT TRAIN..., replay memory = 621
DID NOT TRAIN..., replay memory = 622
DID NOT TRAIN..., replay memory = 623
DID NOT TRAIN..., replay memory = 624
DID NOT TRAIN..., replay memory = 625
DID NOT TRAIN..., replay memory = 626
DID NOT TRAIN..., replay memory = 627
DID NOT TRAIN..., replay memory = 628
DID NOT TRAIN..., replay memory = 629
DID NOT TRAIN..., replay memory = 630
DID NOT TRAIN..., replay memory = 631
DID NOT TRAIN..., replay memory = 632
DID NOT TRAIN..., replay memory = 633
DID NOT TRAIN..., replay memory = 634
DID NOT TRAIN..., replay memory = 635
DID NOT TRAIN..., replay memory = 636
DID NOT TRAIN..., replay memory = 637
DID NOT TRAIN..., replay memory = 638
DID NOT TRAIN..., replay memory = 639
DID NOT TRAIN..., replay memory = 640
DID NOT TRAIN..., replay memory = 641
DID NOT TRAIN..., replay memory = 642
DID NOT TRAIN..., replay memory = 643
DID NOT TRAIN..., replay memory = 644
DID NOT TRAIN..., replay memory = 645
DID NOT TRAIN..., replay memory = 646
DID NOT TRAIN..., replay memory = 647
DID NOT TRAIN..., replay memory = 648
DID NOT TRAIN..., replay memory = 649
DID NOT TRAIN..., replay memory = 650
DID NOT TRAIN..., replay memory = 651
DID NOT TRAIN..., replay memory = 652
DID NOT TRAIN..., replay memory = 653
DID NOT TRAIN..., replay memory = 654
DID NOT TRAIN..., replay memory = 655
DID NOT TRAIN..., replay memory = 656
DID NOT TRAIN..., replay memory = 657
DID NOT TRAIN..., replay memory = 658
DID NOT TRAIN..., replay memory = 659
DID NOT TRAIN..., replay memory = 660
DID NOT TRAIN..., replay memory = 661
DID NOT TRAIN..., replay memory = 662
DID NOT TRAIN..., replay memory = 663
DID NOT TRAIN..., replay memory = 664
DID NOT TRAIN..., replay memory = 665
DID NOT TRAIN..., replay memory = 666
DID NOT TRAIN..., replay memory = 667
DID NOT TRAIN..., replay memory = 668
DID NOT TRAIN..., replay memory = 669
DID NOT TRAIN..., replay memory = 670
DID NOT TRAIN..., replay memory = 671
DID NOT TRAIN..., replay memory = 672
DID NOT TRAIN..., replay memory = 673
DID NOT TRAIN..., replay memory = 674
DID NOT TRAIN..., replay memory = 675
DID NOT TRAIN..., replay memory = 676
DID NOT TRAIN..., replay memory = 677
DID NOT TRAIN..., replay memory = 678
DID NOT TRAIN..., replay memory = 679
DID NOT TRAIN..., replay memory = 680
DID NOT TRAIN..., replay memory = 681
DID NOT TRAIN..., replay memory = 682
DID NOT TRAIN..., replay memory = 683
DID NOT TRAIN..., replay memory = 684
DID NOT TRAIN..., replay memory = 685
DID NOT TRAIN..., replay memory = 686
DID NOT TRAIN..., replay memory = 687
DID NOT TRAIN..., replay memory = 688
DID NOT TRAIN..., replay memory = 689
DID NOT TRAIN..., replay memory = 690
DID NOT TRAIN..., replay memory = 691
DID NOT TRAIN..., replay memory = 692
DID NOT TRAIN..., replay memory = 693
DID NOT TRAIN..., replay memory = 694
DID NOT TRAIN..., replay memory = 695
DID NOT TRAIN..., replay memory = 696
DID NOT TRAIN..., replay memory = 697
DID NOT TRAIN..., replay memory = 698
DID NOT TRAIN..., replay memory = 699
DID NOT TRAIN..., replay memory = 700
DID NOT TRAIN..., replay memory = 701
DID NOT TRAIN..., replay memory = 702
DID NOT TRAIN..., replay memory = 703
DID NOT TRAIN..., replay memory = 704
DID NOT TRAIN..., replay memory = 705
DID NOT TRAIN..., replay memory = 706
DID NOT TRAIN..., replay memory = 707
DID NOT TRAIN..., replay memory = 708
DID NOT TRAIN..., replay memory = 709
DID NOT TRAIN..., replay memory = 710
DID NOT TRAIN..., replay memory = 711
DID NOT TRAIN..., replay memory = 712
DID NOT TRAIN..., replay memory = 713
DID NOT TRAIN..., replay memory = 714
DID NOT TRAIN..., replay memory = 715
DID NOT TRAIN..., replay memory = 716
DID NOT TRAIN..., replay memory = 717
DID NOT TRAIN..., replay memory = 718
DID NOT TRAIN..., replay memory = 719
DID NOT TRAIN..., replay memory = 720
DID NOT TRAIN..., replay memory = 721
DID NOT TRAIN..., replay memory = 722
DID NOT TRAIN..., replay memory = 723
DID NOT TRAIN..., replay memory = 724
DID NOT TRAIN..., replay memory = 725
DID NOT TRAIN..., replay memory = 726
DID NOT TRAIN..., replay memory = 727
DID NOT TRAIN..., replay memory = 728
DID NOT TRAIN..., replay memory = 729
DID NOT TRAIN..., replay memory = 730
DID NOT TRAIN..., replay memory = 731
DID NOT TRAIN..., replay memory = 732
DID NOT TRAIN..., replay memory = 733
DID NOT TRAIN..., replay memory = 734
DID NOT TRAIN..., replay memory = 735
DID NOT TRAIN..., replay memory = 736
DID NOT TRAIN..., replay memory = 737
DID NOT TRAIN..., replay memory = 738
DID NOT TRAIN..., replay memory = 739
DID NOT TRAIN..., replay memory = 740
DID NOT TRAIN..., replay memory = 741
DID NOT TRAIN..., replay memory = 742
DID NOT TRAIN..., replay memory = 743
DID NOT TRAIN..., replay memory = 744
DID NOT TRAIN..., replay memory = 745
DID NOT TRAIN..., replay memory = 746
DID NOT TRAIN..., replay memory = 747
DID NOT TRAIN..., replay memory = 748
DID NOT TRAIN..., replay memory = 749
DID NOT TRAIN..., replay memory = 750
DID NOT TRAIN..., replay memory = 751
DID NOT TRAIN..., replay memory = 752
DID NOT TRAIN..., replay memory = 753
DID NOT TRAIN..., replay memory = 754
DID NOT TRAIN..., replay memory = 755
DID NOT TRAIN..., replay memory = 756
DID NOT TRAIN..., replay memory = 757
DID NOT TRAIN..., replay memory = 758
DID NOT TRAIN..., replay memory = 759
DID NOT TRAIN..., replay memory = 760
DID NOT TRAIN..., replay memory = 761
DID NOT TRAIN..., replay memory = 762
DID NOT TRAIN..., replay memory = 763
DID NOT TRAIN..., replay memory = 764
DID NOT TRAIN..., replay memory = 765
DID NOT TRAIN..., replay memory = 766
DID NOT TRAIN..., replay memory = 767
DID NOT TRAIN..., replay memory = 768
DID NOT TRAIN..., replay memory = 769
DID NOT TRAIN..., replay memory = 770
DID NOT TRAIN..., replay memory = 771
DID NOT TRAIN..., replay memory = 772
DID NOT TRAIN..., replay memory = 773
DID NOT TRAIN..., replay memory = 774
DID NOT TRAIN..., replay memory = 775
DID NOT TRAIN..., replay memory = 776
DID NOT TRAIN..., replay memory = 777
DID NOT TRAIN..., replay memory = 778
DID NOT TRAIN..., replay memory = 779
DID NOT TRAIN..., replay memory = 780
DID NOT TRAIN..., replay memory = 781
DID NOT TRAIN..., replay memory = 782
DID NOT TRAIN..., replay memory = 783
DID NOT TRAIN..., replay memory = 784
DID NOT TRAIN..., replay memory = 785
DID NOT TRAIN..., replay memory = 786
DID NOT TRAIN..., replay memory = 787
DID NOT TRAIN..., replay memory = 788
DID NOT TRAIN..., replay memory = 789
DID NOT TRAIN..., replay memory = 790
DID NOT TRAIN..., replay memory = 791
DID NOT TRAIN..., replay memory = 792
DID NOT TRAIN..., replay memory = 793
DID NOT TRAIN..., replay memory = 794
DID NOT TRAIN..., replay memory = 795
DID NOT TRAIN..., replay memory = 796
DID NOT TRAIN..., replay memory = 797
DID NOT TRAIN..., replay memory = 798
DID NOT TRAIN..., replay memory = 799
DID NOT TRAIN..., replay memory = 800
Training Episode 4 finished! Return: -1618.56
Training Episode 5
DID NOT TRAIN..., replay memory = 801
DID NOT TRAIN..., replay memory = 802
DID NOT TRAIN..., replay memory = 803
DID NOT TRAIN..., replay memory = 804
DID NOT TRAIN..., replay memory = 805
DID NOT TRAIN..., replay memory = 806
DID NOT TRAIN..., replay memory = 807
DID NOT TRAIN..., replay memory = 808
DID NOT TRAIN..., replay memory = 809
DID NOT TRAIN..., replay memory = 810
DID NOT TRAIN..., replay memory = 811
DID NOT TRAIN..., replay memory = 812
DID NOT TRAIN..., replay memory = 813
DID NOT TRAIN..., replay memory = 814
DID NOT TRAIN..., replay memory = 815
DID NOT TRAIN..., replay memory = 816
DID NOT TRAIN..., replay memory = 817
DID NOT TRAIN..., replay memory = 818
DID NOT TRAIN..., replay memory = 819
DID NOT TRAIN..., replay memory = 820
DID NOT TRAIN..., replay memory = 821
DID NOT TRAIN..., replay memory = 822
DID NOT TRAIN..., replay memory = 823
DID NOT TRAIN..., replay memory = 824
DID NOT TRAIN..., replay memory = 825
DID NOT TRAIN..., replay memory = 826
DID NOT TRAIN..., replay memory = 827
DID NOT TRAIN..., replay memory = 828
DID NOT TRAIN..., replay memory = 829
DID NOT TRAIN..., replay memory = 830
DID NOT TRAIN..., replay memory = 831
DID NOT TRAIN..., replay memory = 832
DID NOT TRAIN..., replay memory = 833
DID NOT TRAIN..., replay memory = 834
DID NOT TRAIN..., replay memory = 835
DID NOT TRAIN..., replay memory = 836
DID NOT TRAIN..., replay memory = 837
DID NOT TRAIN..., replay memory = 838
DID NOT TRAIN..., replay memory = 839
DID NOT TRAIN..., replay memory = 840
DID NOT TRAIN..., replay memory = 841
DID NOT TRAIN..., replay memory = 842
DID NOT TRAIN..., replay memory = 843
DID NOT TRAIN..., replay memory = 844
DID NOT TRAIN..., replay memory = 845
DID NOT TRAIN..., replay memory = 846
DID NOT TRAIN..., replay memory = 847
DID NOT TRAIN..., replay memory = 848
DID NOT TRAIN..., replay memory = 849
DID NOT TRAIN..., replay memory = 850
DID NOT TRAIN..., replay memory = 851
DID NOT TRAIN..., replay memory = 852
DID NOT TRAIN..., replay memory = 853
DID NOT TRAIN..., replay memory = 854
DID NOT TRAIN..., replay memory = 855
DID NOT TRAIN..., replay memory = 856
DID NOT TRAIN..., replay memory = 857
DID NOT TRAIN..., replay memory = 858
DID NOT TRAIN..., replay memory = 859
DID NOT TRAIN..., replay memory = 860
DID NOT TRAIN..., replay memory = 861
DID NOT TRAIN..., replay memory = 862
DID NOT TRAIN..., replay memory = 863
DID NOT TRAIN..., replay memory = 864
DID NOT TRAIN..., replay memory = 865
DID NOT TRAIN..., replay memory = 866
DID NOT TRAIN..., replay memory = 867
DID NOT TRAIN..., replay memory = 868
DID NOT TRAIN..., replay memory = 869
DID NOT TRAIN..., replay memory = 870
DID NOT TRAIN..., replay memory = 871
DID NOT TRAIN..., replay memory = 872
DID NOT TRAIN..., replay memory = 873
DID NOT TRAIN..., replay memory = 874
DID NOT TRAIN..., replay memory = 875
DID NOT TRAIN..., replay memory = 876
DID NOT TRAIN..., replay memory = 877
DID NOT TRAIN..., replay memory = 878
DID NOT TRAIN..., replay memory = 879
DID NOT TRAIN..., replay memory = 880
DID NOT TRAIN..., replay memory = 881
DID NOT TRAIN..., replay memory = 882
DID NOT TRAIN..., replay memory = 883
DID NOT TRAIN..., replay memory = 884
DID NOT TRAIN..., replay memory = 885
DID NOT TRAIN..., replay memory = 886
DID NOT TRAIN..., replay memory = 887
DID NOT TRAIN..., replay memory = 888
DID NOT TRAIN..., replay memory = 889
DID NOT TRAIN..., replay memory = 890
DID NOT TRAIN..., replay memory = 891
DID NOT TRAIN..., replay memory = 892
DID NOT TRAIN..., replay memory = 893
DID NOT TRAIN..., replay memory = 894
DID NOT TRAIN..., replay memory = 895
DID NOT TRAIN..., replay memory = 896
DID NOT TRAIN..., replay memory = 897
DID NOT TRAIN..., replay memory = 898
DID NOT TRAIN..., replay memory = 899
DID NOT TRAIN..., replay memory = 900
DID NOT TRAIN..., replay memory = 901
DID NOT TRAIN..., replay memory = 902
DID NOT TRAIN..., replay memory = 903
DID NOT TRAIN..., replay memory = 904
DID NOT TRAIN..., replay memory = 905
DID NOT TRAIN..., replay memory = 906
DID NOT TRAIN..., replay memory = 907
DID NOT TRAIN..., replay memory = 908
DID NOT TRAIN..., replay memory = 909
DID NOT TRAIN..., replay memory = 910
DID NOT TRAIN..., replay memory = 911
DID NOT TRAIN..., replay memory = 912
DID NOT TRAIN..., replay memory = 913
DID NOT TRAIN..., replay memory = 914
DID NOT TRAIN..., replay memory = 915
DID NOT TRAIN..., replay memory = 916
DID NOT TRAIN..., replay memory = 917
DID NOT TRAIN..., replay memory = 918
DID NOT TRAIN..., replay memory = 919
DID NOT TRAIN..., replay memory = 920
DID NOT TRAIN..., replay memory = 921
DID NOT TRAIN..., replay memory = 922
DID NOT TRAIN..., replay memory = 923
DID NOT TRAIN..., replay memory = 924
DID NOT TRAIN..., replay memory = 925
DID NOT TRAIN..., replay memory = 926
DID NOT TRAIN..., replay memory = 927
DID NOT TRAIN..., replay memory = 928
DID NOT TRAIN..., replay memory = 929
DID NOT TRAIN..., replay memory = 930
DID NOT TRAIN..., replay memory = 931
DID NOT TRAIN..., replay memory = 932
DID NOT TRAIN..., replay memory = 933
DID NOT TRAIN..., replay memory = 934
DID NOT TRAIN..., replay memory = 935
DID NOT TRAIN..., replay memory = 936
DID NOT TRAIN..., replay memory = 937
DID NOT TRAIN..., replay memory = 938
DID NOT TRAIN..., replay memory = 939
DID NOT TRAIN..., replay memory = 940
DID NOT TRAIN..., replay memory = 941
DID NOT TRAIN..., replay memory = 942
DID NOT TRAIN..., replay memory = 943
DID NOT TRAIN..., replay memory = 944
DID NOT TRAIN..., replay memory = 945
DID NOT TRAIN..., replay memory = 946
DID NOT TRAIN..., replay memory = 947
DID NOT TRAIN..., replay memory = 948
DID NOT TRAIN..., replay memory = 949
DID NOT TRAIN..., replay memory = 950
DID NOT TRAIN..., replay memory = 951
DID NOT TRAIN..., replay memory = 952
DID NOT TRAIN..., replay memory = 953
DID NOT TRAIN..., replay memory = 954
DID NOT TRAIN..., replay memory = 955
DID NOT TRAIN..., replay memory = 956
DID NOT TRAIN..., replay memory = 957
DID NOT TRAIN..., replay memory = 958
DID NOT TRAIN..., replay memory = 959
DID NOT TRAIN..., replay memory = 960
DID NOT TRAIN..., replay memory = 961
DID NOT TRAIN..., replay memory = 962
DID NOT TRAIN..., replay memory = 963
DID NOT TRAIN..., replay memory = 964
DID NOT TRAIN..., replay memory = 965
DID NOT TRAIN..., replay memory = 966
DID NOT TRAIN..., replay memory = 967
DID NOT TRAIN..., replay memory = 968
DID NOT TRAIN..., replay memory = 969
DID NOT TRAIN..., replay memory = 970
DID NOT TRAIN..., replay memory = 971
DID NOT TRAIN..., replay memory = 972
DID NOT TRAIN..., replay memory = 973
DID NOT TRAIN..., replay memory = 974
DID NOT TRAIN..., replay memory = 975
DID NOT TRAIN..., replay memory = 976
DID NOT TRAIN..., replay memory = 977
DID NOT TRAIN..., replay memory = 978
DID NOT TRAIN..., replay memory = 979
DID NOT TRAIN..., replay memory = 980
DID NOT TRAIN..., replay memory = 981
DID NOT TRAIN..., replay memory = 982
DID NOT TRAIN..., replay memory = 983
DID NOT TRAIN..., replay memory = 984
DID NOT TRAIN..., replay memory = 985
DID NOT TRAIN..., replay memory = 986
DID NOT TRAIN..., replay memory = 987
DID NOT TRAIN..., replay memory = 988
DID NOT TRAIN..., replay memory = 989
DID NOT TRAIN..., replay memory = 990
DID NOT TRAIN..., replay memory = 991
DID NOT TRAIN..., replay memory = 992
DID NOT TRAIN..., replay memory = 993
DID NOT TRAIN..., replay memory = 994
DID NOT TRAIN..., replay memory = 995
DID NOT TRAIN..., replay memory = 996
DID NOT TRAIN..., replay memory = 997
DID NOT TRAIN..., replay memory = 998
DID NOT TRAIN..., replay memory = 999
Training Episode 5 finished! Return: -1614.00
Training Episode 6
Training Episode 6 finished! Return: -1379.36
Training Episode 7
Training Episode 7 finished! Return: -1724.36
Training Episode 8
Training Episode 8 finished! Return: -1347.38
Training Episode 9
Training Episode 9 finished! Return: -1575.49
Training Episode 10
Training Episode 10 finished! Return: -1405.39
Average training reward over last 10 episodes: -1476.16
Training Episode 11
Training Episode 11 finished! Return: -1115.74
Training Episode 12
Training Episode 12 finished! Return: -1337.94
Training Episode 13
Training Episode 13 finished! Return: -1539.58
Training Episode 14
Training Episode 14 finished! Return: -804.25
Training Episode 15
Training Episode 15 finished! Return: -1561.63
Training Episode 16
Training Episode 16 finished! Return: -983.77
Training Episode 17
Training Episode 17 finished! Return: -980.49
Training Episode 18
Training Episode 18 finished! Return: -1038.43
Training Episode 19
Training Episode 19 finished! Return: -901.08
Training Episode 20
Training Episode 20 finished! Return: -1161.31
Average training reward over last 10 episodes: -1142.42
Training Episode 21
Training Episode 21 finished! Return: -1202.87
Training Episode 22
Training Episode 22 finished! Return: -1094.92
Training Episode 23
Training Episode 23 finished! Return: -1152.79
Training Episode 24
Training Episode 24 finished! Return: -1439.27
Training Episode 25
Training Episode 25 finished! Return: -1187.26
Training Episode 26
Training Episode 26 finished! Return: -1192.24
Training Episode 27
Training Episode 27 finished! Return: -1224.08
Training Episode 28
Training Episode 28 finished! Return: -1343.43
Training Episode 29
Training Episode 29 finished! Return: -1297.47
Training Episode 30
Training Episode 30 finished! Return: -1425.91
Average training reward over last 10 episodes: -1256.02
Training Episode 31
Training Episode 31 finished! Return: -1340.77
Training Episode 32
Training Episode 32 finished! Return: -1235.98
Training Episode 33
Training Episode 33 finished! Return: -1094.69
Training Episode 34
Training Episode 34 finished! Return: -1348.90
Training Episode 35
Training Episode 35 finished! Return: -1189.47
Training Episode 36
Training Episode 36 finished! Return: -1356.94
Training Episode 37
Training Episode 37 finished! Return: -1205.53
Training Episode 38
Training Episode 38 finished! Return: -1178.37
Training Episode 39
Training Episode 39 finished! Return: -1026.52
Training Episode 40
Training Episode 40 finished! Return: -933.36
Average training reward over last 10 episodes: -1191.05
Training Episode 41
Training Episode 41 finished! Return: -905.85
Training Episode 42
Training Episode 42 finished! Return: -1042.72
Training Episode 43
Training Episode 43 finished! Return: -1151.62
Training Episode 44
Training Episode 44 finished! Return: -961.30
Training Episode 45
Training Episode 45 finished! Return: -772.12
Training Episode 46
Training Episode 46 finished! Return: -1355.70
Training Episode 47
Training Episode 47 finished! Return: -1142.58
Training Episode 48
Training Episode 48 finished! Return: -1041.63
Training Episode 49
Training Episode 49 finished! Return: -1149.37
Training Episode 50
Training Episode 50 finished! Return: -904.52
Average training reward over last 10 episodes: -1042.74
Training Episode 51
Training Episode 51 finished! Return: -1065.34
Training Episode 52
Training Episode 52 finished! Return: -910.24
Training Episode 53
Training Episode 53 finished! Return: -926.86
Training Episode 54
Training Episode 54 finished! Return: -1287.63
Training Episode 55
Training Episode 55 finished! Return: -886.44
Training Episode 56
Training Episode 56 finished! Return: -734.31
Training Episode 57
Training Episode 57 finished! Return: -981.67
Training Episode 58
Training Episode 58 finished! Return: -643.39
Training Episode 59
Training Episode 59 finished! Return: -613.61
Training Episode 60
Training Episode 60 finished! Return: -754.87
Average training reward over last 10 episodes: -880.44
Training Episode 61
Training Episode 61 finished! Return: -521.64
Training Episode 62
Training Episode 62 finished! Return: -883.59
Training Episode 63
Training Episode 63 finished! Return: -1112.11
Training Episode 64
Training Episode 64 finished! Return: -621.69
Training Episode 65
Training Episode 65 finished! Return: -920.78
Training Episode 66
Training Episode 66 finished! Return: -512.67
Training Episode 67
Training Episode 67 finished! Return: -532.41
Training Episode 68
Training Episode 68 finished! Return: -1001.03
Training Episode 69
Training Episode 69 finished! Return: -254.74
Training Episode 70
Training Episode 70 finished! Return: -257.24
Average training reward over last 10 episodes: -661.79
Training Episode 71
Training Episode 71 finished! Return: -124.81
Training Episode 72
Training Episode 72 finished! Return: -1462.87
Training Episode 73
Training Episode 73 finished! Return: -586.48
Training Episode 74
Training Episode 74 finished! Return: -726.08
Training Episode 75
Training Episode 75 finished! Return: -376.33
Training Episode 76
Training Episode 76 finished! Return: -378.55
Training Episode 77
Training Episode 77 finished! Return: -617.64
Training Episode 78
Training Episode 78 finished! Return: -627.29
Training Episode 79
Training Episode 79 finished! Return: -606.40
Training Episode 80
Training Episode 80 finished! Return: -125.21
Average training reward over last 10 episodes: -563.17
Training Episode 81
Training Episode 81 finished! Return: -489.49
Training Episode 82
Training Episode 82 finished! Return: -258.25
Training Episode 83
Training Episode 83 finished! Return: -242.51
Training Episode 84
Training Episode 84 finished! Return: -525.97
Training Episode 85
Training Episode 85 finished! Return: -1026.58
Training Episode 86
Training Episode 86 finished! Return: -242.15
Training Episode 87
Training Episode 87 finished! Return: -125.96
Training Episode 88
Training Episode 88 finished! Return: -371.61
Training Episode 89
Training Episode 89 finished! Return: -1.10
Training Episode 90
Training Episode 90 finished! Return: -247.00
Average training reward over last 10 episodes: -353.06
Training Episode 91
Training Episode 91 finished! Return: -382.53
Training Episode 92
Training Episode 92 finished! Return: -909.70
Training Episode 93
Training Episode 93 finished! Return: -5.58
Training Episode 94
Training Episode 94 finished! Return: -781.60
Training Episode 95
Training Episode 95 finished! Return: -118.92
Training Episode 96
Training Episode 96 finished! Return: -599.34
Training Episode 97
Training Episode 97 finished! Return: -128.16
Training Episode 98
Training Episode 98 finished! Return: -497.51
Training Episode 99
Training Episode 99 finished! Return: -121.65
Training Episode 100
Training Episode 100 finished! Return: -126.10
Average training reward over last 10 episodes: -367.11
Training Episode 101
Training Episode 101 finished! Return: -364.61
Training Episode 102
Training Episode 102 finished! Return: -375.18
Training Episode 103
Training Episode 103 finished! Return: -126.07
Training Episode 104
Training Episode 104 finished! Return: -1047.38
Training Episode 105
Training Episode 105 finished! Return: -125.18
Training Episode 106
Training Episode 106 finished! Return: -513.01
Training Episode 107
Training Episode 107 finished! Return: -436.61
Training Episode 108
Training Episode 108 finished! Return: -123.61
Training Episode 109
Training Episode 109 finished! Return: -382.80
Training Episode 110
Training Episode 110 finished! Return: -911.98
Average training reward over last 10 episodes: -440.64
Training Episode 111
Training Episode 111 finished! Return: -507.09
Training Episode 112
Training Episode 112 finished! Return: -248.01
Training Episode 113
Training Episode 113 finished! Return: -2.24
Training Episode 114
Training Episode 114 finished! Return: -127.15
Training Episode 115
Training Episode 115 finished! Return: -496.06
Training Episode 116
Training Episode 116 finished! Return: -362.84
Training Episode 117
Training Episode 117 finished! Return: -374.40
Training Episode 118
Training Episode 118 finished! Return: -236.18
Training Episode 119
Training Episode 119 finished! Return: -122.33
Training Episode 120
Training Episode 120 finished! Return: -790.19
Average training reward over last 10 episodes: -326.65
Training Episode 121
Training Episode 121 finished! Return: -362.58
Training Episode 122
Training Episode 122 finished! Return: -360.29
Training Episode 123
Training Episode 123 finished! Return: -117.22
Training Episode 124
Training Episode 124 finished! Return: -238.55
Training Episode 125
Training Episode 125 finished! Return: -0.84
Training Episode 126
Training Episode 126 finished! Return: -2.32
Training Episode 127
Training Episode 127 finished! Return: -242.54
Training Episode 128
Training Episode 128 finished! Return: -125.05
Training Episode 129
Training Episode 129 finished! Return: -123.11
Training Episode 130
Training Episode 130 finished! Return: -124.56
Average training reward over last 10 episodes: -169.71
Training Episode 131
Training Episode 131 finished! Return: -374.59
Training Episode 132
Training Episode 132 finished! Return: -251.72
Training Episode 133
Training Episode 133 finished! Return: -239.97
Training Episode 134
Training Episode 134 finished! Return: -127.34
Training Episode 135
Training Episode 135 finished! Return: -120.55
Training Episode 136
Training Episode 136 finished! Return: -250.48
Training Episode 137
Training Episode 137 finished! Return: -734.39
Training Episode 138
Training Episode 138 finished! Return: -251.37
Training Episode 139
Training Episode 139 finished! Return: -514.54
Training Episode 140
Training Episode 140 finished! Return: -126.75
Average training reward over last 10 episodes: -299.17
Training Episode 141
Training Episode 141 finished! Return: -235.84
Training Episode 142
Training Episode 142 finished! Return: -469.74
Training Episode 143
Training Episode 143 finished! Return: -121.83
Training Episode 144
Training Episode 144 finished! Return: -252.01
Training Episode 145
Training Episode 145 finished! Return: -253.69
Training Episode 146
Training Episode 146 finished! Return: -241.64
Training Episode 147
Training Episode 147 finished! Return: -245.61
Training Episode 148
Training Episode 148 finished! Return: -0.83
Training Episode 149
Training Episode 149 finished! Return: -125.18
Training Episode 150
Training Episode 150 finished! Return: -261.89
Average training reward over last 10 episodes: -220.82
Training completed in: 858.70 seconds

Final training reward over last 10 episodes: -220.82

=== EVALUATION PHASE (Epsilon=0, No Training) ===
Running 10 evaluation episodes...
Evaluation Episode 1 finished! Return: -250.53
Evaluation Episode 2 finished! Return: -248.76
Evaluation Episode 3 finished! Return: -340.44
Evaluation Episode 4 finished! Return: -121.00
Evaluation Episode 5 finished! Return: -121.46
Evaluation Episode 6 finished! Return: -118.74
Evaluation Episode 7 finished! Return: -247.79
Evaluation Episode 8 finished! Return: -119.82
Evaluation Episode 9 finished! Return: -118.79
Evaluation Episode 10 finished! Return: -120.70

=== EVALUATION RESULTS ===
Mean Reward: -180.80 ± 78.47
Min Reward:  -340.44
Max Reward:  -118.74
All Rewards: ['-250.5', '-248.8', '-340.4', '-121.0', '-121.5', '-118.7', '-247.8', '-119.8', '-118.8', '-120.7']

=== TRAINING vs EVALUATION COMPARISON ===
Final Training Performance (with epsilon=0.1): -220.82
Evaluation Performance (with epsilon=0.0):      -180.80
gent performs 40.02 points BETTER without exploration!

Total time (training + evaluation): 863.65 seconds

Observations and analysis

  • With epsilon = 0 for evaluation, performance significantly improves (+40.02 points) because the agent uses only its learned policy without random disruptions

  • The agent shows run-to-run variance typical of DQN training, with this run achieving better overall performance (-180.80) compared to the previous attempt (-212.00)

  • Despite the improvement, the agent's policy remains inconsistent, sometimes performing well (-118.7) but sometimes poorly (-340.4), with substantial variance (±78.47) suggesting training stability issues

  • The performance range (-340.4 to -118.7) spans 221.7 points, indicating the learned policy hasn't fully converged to a stable solution

While the improved teacher's code showed that epsilon=0 improves performance (+40.02 points), the substantial variance (±78.47) and inconsistent results (-340.4 to -118.7) suggest that the fixed choice of 40 discrete actions may not be optimal. The run-to-run performance differences further highlight the need for systematic exploration of action space discretizations to find more stable and consistent configurations.

Limitation of practical codes Teacher's approach limitations:

  • Fixed N_ACTIONS = 40 (arbitrary choice)
  • Manual action conversion formulas (error-prone)
  • No systematic comparison of action granularities
  • High performance variance suggests suboptimal discretization

Comprehensive plan ¶

Step 1: Action Space Discretization Exploration

  • Implement systematic comparison of different action space discretizations (5, 11, 21 actions)
  • Use identical DQN architecture and hyperparameters across all experiments to ensure fair comparison
  • Train and evaluate each configuration, generate learning curves, and analyze performance trends
  • Identify optimal discretization level based on final performance and learning stability

Step 2: Training Protocol Optimization

  • Systematically test different training lengths (200, 400, 600 episodes)
  • Evaluate learning stability, convergence patterns, and computational efficiency
  • Determine optimal episode count that balances training time with performance gains
  • Establish robust training protocol for subsequent experiments

Step 3: Exploration Strategy Optimization

  • Compare epsilon decay strategies (linear decay, exponential decay, plateau restart)
  • Analyze exploration-exploitation trade-offs across different learning phases
  • Evaluate impact on learning curves, final performance, and training stability
  • Select exploration strategy that maximizes learning efficiency

Step 4: Memory Architecture Optimization

  • Systematically test replay memory configurations with varying buffer sizes and minimum thresholds
  • Analyze memory utilization patterns, training stability, and evaluation consistency
  • Investigate impact of memory architecture on sample efficiency and performance
  • Establish optimal memory configuration for experience replay

Step 5: Learning Hyperparameter Optimization

  • Conduct systematic hyperparameter tuning using one-factor-at-a-time methodology:
    • Learning rate optimization (testing 1e-4, 3e-4, 1e-3)
    • Batch size tuning (comparing 32, 64, 128)
    • Discount factor analysis (evaluating 0.95, 0.99, 0.995)
    • Target network update frequency (testing 5, 10, 20 episodes)
  • Apply statistical validation with multiple evaluation runs and confidence interval analysis
  • Optimize neural network learning dynamics for maximum performance

Step 6: Network Architecture Exploration

  • Test different neural network architectures using optimized hyperparameters
  • Compare network depths (2 vs 3 hidden layers) and widths (64 vs 128 neurons)
  • Evaluate activation functions (ReLU vs Leaky ReLU vs ELU) for optimal function approximation
  • Fine-tune network structure to maximize learning capacity and stability

Step 7: Final Integration and Comprehensive Validation

  • Integrate all optimized components into final DQN configuration
  • Conduct extensive evaluation with multiple independent runs (10 runs × 50 episodes)
  • Generate comprehensive performance visualizations and statistical analysis
  • Validate complete optimized system against baseline performance

Step 8: Comparative Analysis and Documentation

  • Compare final optimized configuration against original baseline and intermediate configurations
  • Document systematic optimization methodology and quantify improvements at each phase
  • Analyze contribution of each optimization step to overall performance enhancement
  • Provide justification for final configuration choices based on empirical evidence

We further improved upon the practical codes and made it 'baseline'¶

What makes it an optimized baseline model ¶

Network Architecture:

  • Input: 3 nodes (cosθ, sinθ, θ_dot)

  • Hidden Layers: 2 layers, each with 64 neurons (balanced capacity)

  • Output: N Q-values (where N is determined through systematic exploration)

Optimized Discretization:

  • Action count determined empirically through comparison of [5, 11, 21, 50] discrete actions

  • Torque range: [-2, 2] maintained across all experiments

  • Selection based on performance data rather than arbitrary choice

Hyperparameters:

  • Batch size: 64 (proven effective)

  • Replay memory: 50,000 (sufficient for stable learning)

  • γ (discount): 0.99 (standard for continuous control)

  • Learning rate: 3e-4 (reliable default)

  • ε-greedy exploration: starts at 1.0, decays to 0.05 with 0.995 decay rate

  • Target update every 5 episodes (stability-performance balance)

Standard DQN Core Features:

  • Experience replay buffer

  • Separate target network

  • ε-greedy exploration strategy

  • Consistent training methodology across all configurations

Methodology Advantages:

  • Data-driven baseline selection rather than arbitrary parameter choice

  • Reproducible results through fixed random seeds

  • Systematic comparison framework for fair evaluation

  • Comprehensive performance documentation (learning curves, evaluation metrics, visual demonstrations)

NOT included: Advanced algorithmic modifications (Double DQN, Dueling DQN, Prioritized Replay, etc.) - these will be evaluated against the optimized baseline¶


The model below vs the model from practical

  1. Code Structure
  • In practical, there is only one single DQN class handling both model and agent logic
  • For the codes below, there is a separate DQN model class and DQNAgent class
  1. Action Space Handling
  • The practical codes does not allow the range of torque to reach +2 but the codes below uses 'np.linspace' which is much more precise as it covers every number from -2 to 2
  1. Model Architecture
  • The practical yses the normal Keras Functional API and separate models for main and target networks
  • I used Keras Model subclassing and has cleaner more pythonic model definition
  1. Training Loop & Epsilon Decay
  • The practical has a fixed epsilon with no learning improvement over timee
  • We used a proper epsilon decay (1.0 to 0.05)
  1. Experimental Design
  • The practical has a single experiment with fixed parameters
  • The codes below has multiple experiments with different action spaces and gives a comprehensive evaluation with 10 test episodes.

In [31]:
# Remove the default parameter since it references undefined N_ACTIONS
def action_index_to_torque(a_idx, n_actions):  #  Remove =N_ACTIONS
    """Convert discrete action index to torque value for Pendulum-v0."""
    torque = np.linspace(-2.0, 2.0, n_actions)[a_idx]
    return np.array([torque])

def torque_to_action_index(torque, n_actions):  # Remove =N_ACTIONS
    torque_bins = np.linspace(-2.0, 2.0, n_actions)
    idx = np.argmin(np.abs(torque_bins - torque))
    return idx
In [32]:
# DQN Model 
class DQN(Model):
    def __init__(self, input_shape, n_actions, hidden_size=64):
        super().__init__()
        self.d1 = Dense(hidden_size, activation='relu')
        self.d2 = Dense(hidden_size, activation='relu')
        self.out = Dense(n_actions, activation='linear')
        # Build model by calling once
        self(np.zeros((1, input_shape)))

    def call(self, x):
        x = self.d1(x)
        x = self.d2(x)
        return self.out(x)
In [6]:
# Agent 
class DQNAgent:
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory, batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, epsilon_decay):
        self.input_shape = input_shape
        self.n_actions = n_actions
        self.gamma = gamma
        self.memory = deque(maxlen=replay_memory_size)
        self.min_replay_memory = min_replay_memory
        self.batch_size = batch_size
        self.target_update_every = target_update_every
        self.epsilon = epsilon_start
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.model = DQN(input_shape, n_actions)
        self.target = DQN(input_shape, n_actions)
        self.target.set_weights(self.model.get_weights())
        self.optimizer = Adam(learning_rate=learning_rate)
        self.steps = 0

    def summary(self):
        print("\nModel Summary:")
        self.model.summary()

    def select_action(self, state):
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.n_actions)
        q_values = self.model(state[np.newaxis, :]).numpy()
        return np.argmax(q_values)

    def remember(self, s, a, r, s_next, done):
        self.memory.append((s, a, r, s_next, done))

    def train_step(self):
        if len(self.memory) < self.min_replay_memory:
            return
        batch = random.sample(self.memory, self.batch_size)
        s, a, r, s_next, done = zip(*batch)
        s = np.array(s)
        s_next = np.array(s_next)
        r = np.array(r, dtype=np.float32)
        done = np.array(done, dtype=bool)
        target_q = self.target(s_next).numpy()
        max_q_next = np.max(target_q, axis=1)
        target = self.model(s).numpy()
        for i in range(self.batch_size):
            if done[i]:
                target[i, a[i]] = r[i]
            else:
                target[i, a[i]] = r[i] + self.gamma * max_q_next[i]
        with tf.GradientTape() as tape:
            q_pred = self.model(s)
            loss = tf.reduce_mean((target - q_pred) ** 2)
        grads = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))

    def update_target(self):
        self.target.set_weights(self.model.get_weights())

    def decay_epsilon(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def save(self, path):
        self.model.save_weights(path)

    def load(self, path):
        self.model.load_weights(path)
        self.target.set_weights(self.model.get_weights())
In [7]:
def train_and_evaluate(n_actions, experiment_prefix, RENDER_EVERY=20, record_gifs=True):
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    N_EPISODES = 200
    MAX_STEPS = 200

    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    TRAIN_PLOT_PATH = f"{experiment_prefix}_training_plot.png"
    EPISODE_TIMES_PATH = f"{experiment_prefix}_episode_times.png"
    EVAL_RETURNS_PATH = f"{experiment_prefix}_eval_returns.png"

    env = gym.make(ENV_NAME)
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    agent.summary()
    scores = []
    best_avg_reward = -np.inf
    episode_times = []
    start = time.time()

    for ep in range(1, N_EPISODES+1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0

        for t in range(MAX_STEPS):
            if ep % RENDER_EVERY == 0:
                env.render()  # Live animation every RENDER_EVERY episodes

            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            agent.remember(s, a_idx, r, s_next, done)
            agent.train_step()
            s = s_next
            total_reward += r
            if done:
                break

        agent.decay_epsilon()
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        if ep in [50, 100, 150, 200]:
            agent.save(f"{experiment_prefix}_{ep}_weights.h5")
        
        scores.append(total_reward)
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        episode_times.append(ep_time)
        print(f"Episode {ep} | Total Reward: {total_reward:.2f} | Avg(10): {avg_reward:.2f} | Epsilon: {agent.epsilon:.3f} | Time: {ep_time:.2f}s")

        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            agent.save(SAVE_WEIGHTS_PATH)

    env.close()
    total_time = time.time() - start

    # Plot rewards
    plt.figure(figsize=(10, 5))
    plt.plot(scores, label='Total Reward per Episode')
    plt.plot([np.mean(scores[max(0, i-9):i+1]) for i in range(len(scores))], label='Moving Avg (10)')
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title(f'DQN Training on Pendulum-v0 ({n_actions} actions)')
    plt.legend()
    plt.savefig(TRAIN_PLOT_PATH)
    plt.close()

    # Plot episode times
    plt.figure()
    plt.plot(episode_times)
    plt.xlabel('Episode')
    plt.ylabel('Time (s)')
    plt.title('Time per Episode')
    plt.savefig(EPISODE_TIMES_PATH)
    plt.close()

    print(f"Best average reward over 10 episodes: {best_avg_reward:.2f}")
    print("Best model weights saved to:", SAVE_WEIGHTS_PATH)
    print(f"Total training time: {total_time:.2f}s")

    # --- Evaluation ---
    env = gym.make(ENV_NAME)
    agent.load(SAVE_WEIGHTS_PATH)
    rewards = []
    episode_states = []  # <-- FIX: initialize before loop
    for ep in range(10):
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0
        states = []
        for t in range(MAX_STEPS):
            states.append(s)
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            total_reward += r
            s = s_next
            if done:
                break
        rewards.append(total_reward)
        episode_states.append(states)
        print(f"Test Episode {ep+1}: Total Reward = {total_reward:.2f}")
    env.close()
    print(f"\nAverage Reward over 10 episodes: {np.mean(rewards):.2f} ± {np.std(rewards):.2f}")

    # Record GIFs for best, worst, average
    if record_gifs:
        best_idx = np.argmax(rewards)
        worst_idx = np.argmin(rewards)
        avg_idx = np.argmin(np.abs(np.array(rewards) - np.mean(rewards)))
        for label, idx in zip(['best', 'worst', 'average'], [best_idx, worst_idx, avg_idx]):
            frames = []
            env_gif = gym.make(ENV_NAME)  # New env for clean replay
            s = env_gif.reset()
            s = s if isinstance(s, np.ndarray) else s[0]
            for state in episode_states[idx]:
                frame = env_gif.render(mode='rgb_array')
                frames.append(frame)
                a_idx = agent.select_action(state)
                torque = action_index_to_torque(a_idx, n_actions)
                s_next, _, _, _ = env_gif.step(torque)
                s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
                s = s_next
            imageio.mimsave(f"{experiment_prefix}_eval_{label}.gif", frames, fps=30)
            print(f"Saved {label} episode GIF to {experiment_prefix}_eval_{label}.gif")
            env_gif.close()

    # Plot episode returns
    plt.figure()
    plt.hist(rewards, bins=10)
    plt.title(f'Episode Returns ({n_actions} actions)')
    plt.xlabel('Total Reward')
    plt.ylabel('Count')
    plt.savefig(EVAL_RETURNS_PATH)
    plt.close() 
In [22]:
if __name__ == "__main__":
    # Set seeds for reproducibility
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    for n_actions in [5, 11, 21, 50]:  
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        print("="*60)
        print(f"Running experiment with N_ACTIONS = {n_actions}")
        train_and_evaluate(n_actions, experiment_prefix, RENDER_EVERY=20, record_gifs=True)
        print("="*60)
============================================================
Running experiment with N_ACTIONS = 5

Model Summary:
Model: "dqn"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               multiple                  256       
                                                                 
 dense_1 (Dense)             multiple                  4160      
                                                                 
 dense_2 (Dense)             multiple                  325       
                                                                 
=================================================================
Total params: 4741 (18.52 KB)
Trainable params: 4741 (18.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1589.35 | Avg(10): -1589.35 | Epsilon: 0.995 | Time: 0.04s
Episode 2 | Total Reward: -926.98 | Avg(10): -1258.16 | Epsilon: 0.990 | Time: 0.04s
Episode 3 | Total Reward: -1578.82 | Avg(10): -1365.05 | Epsilon: 0.985 | Time: 0.04s
Episode 4 | Total Reward: -1291.53 | Avg(10): -1346.67 | Epsilon: 0.980 | Time: 0.05s
Episode 5 | Total Reward: -1407.98 | Avg(10): -1358.93 | Epsilon: 0.975 | Time: 0.26s
Episode 6 | Total Reward: -1052.55 | Avg(10): -1307.87 | Epsilon: 0.970 | Time: 7.30s
Episode 7 | Total Reward: -1316.79 | Avg(10): -1309.14 | Epsilon: 0.966 | Time: 7.07s
Episode 8 | Total Reward: -1719.15 | Avg(10): -1360.39 | Epsilon: 0.961 | Time: 7.32s
Episode 9 | Total Reward: -864.93 | Avg(10): -1305.34 | Epsilon: 0.956 | Time: 7.47s
Episode 10 | Total Reward: -1539.21 | Avg(10): -1328.73 | Epsilon: 0.951 | Time: 7.39s
Episode 11 | Total Reward: -951.22 | Avg(10): -1264.92 | Epsilon: 0.946 | Time: 7.53s
Episode 12 | Total Reward: -1163.89 | Avg(10): -1288.61 | Epsilon: 0.942 | Time: 7.46s
Episode 13 | Total Reward: -1737.85 | Avg(10): -1304.51 | Epsilon: 0.937 | Time: 7.42s
Episode 14 | Total Reward: -1168.11 | Avg(10): -1292.17 | Epsilon: 0.932 | Time: 7.34s
Episode 15 | Total Reward: -962.29 | Avg(10): -1247.60 | Epsilon: 0.928 | Time: 7.15s
Episode 16 | Total Reward: -1068.10 | Avg(10): -1249.15 | Epsilon: 0.923 | Time: 7.26s
Episode 17 | Total Reward: -1528.72 | Avg(10): -1270.35 | Epsilon: 0.918 | Time: 7.29s
Episode 18 | Total Reward: -1541.84 | Avg(10): -1252.62 | Epsilon: 0.914 | Time: 7.48s
Episode 19 | Total Reward: -1388.91 | Avg(10): -1305.01 | Epsilon: 0.909 | Time: 7.24s
Episode 20 | Total Reward: -1300.19 | Avg(10): -1281.11 | Epsilon: 0.905 | Time: 10.16s
Episode 21 | Total Reward: -1571.19 | Avg(10): -1343.11 | Epsilon: 0.900 | Time: 6.59s
Episode 22 | Total Reward: -1096.25 | Avg(10): -1336.34 | Epsilon: 0.896 | Time: 6.07s
Episode 23 | Total Reward: -1717.31 | Avg(10): -1334.29 | Epsilon: 0.891 | Time: 6.40s
Episode 24 | Total Reward: -1047.12 | Avg(10): -1322.19 | Epsilon: 0.887 | Time: 6.00s
Episode 25 | Total Reward: -1335.62 | Avg(10): -1359.52 | Epsilon: 0.882 | Time: 6.12s
Episode 26 | Total Reward: -1079.22 | Avg(10): -1360.64 | Epsilon: 0.878 | Time: 5.81s
Episode 27 | Total Reward: -999.25 | Avg(10): -1307.69 | Epsilon: 0.873 | Time: 6.23s
Episode 28 | Total Reward: -923.58 | Avg(10): -1245.87 | Epsilon: 0.869 | Time: 6.00s
Episode 29 | Total Reward: -1437.92 | Avg(10): -1250.77 | Epsilon: 0.865 | Time: 5.91s
Episode 30 | Total Reward: -965.71 | Avg(10): -1217.32 | Epsilon: 0.860 | Time: 6.35s
Episode 31 | Total Reward: -1213.23 | Avg(10): -1181.52 | Epsilon: 0.856 | Time: 5.62s
Episode 32 | Total Reward: -1280.28 | Avg(10): -1199.93 | Epsilon: 0.852 | Time: 5.75s
Episode 33 | Total Reward: -1064.47 | Avg(10): -1134.64 | Epsilon: 0.848 | Time: 7.93s
Episode 34 | Total Reward: -1370.55 | Avg(10): -1166.98 | Epsilon: 0.843 | Time: 6.18s
Episode 35 | Total Reward: -1430.64 | Avg(10): -1176.49 | Epsilon: 0.839 | Time: 7.23s
Episode 36 | Total Reward: -1035.51 | Avg(10): -1172.12 | Epsilon: 0.835 | Time: 6.96s
Episode 37 | Total Reward: -1090.94 | Avg(10): -1181.28 | Epsilon: 0.831 | Time: 7.28s
Episode 38 | Total Reward: -1547.55 | Avg(10): -1243.68 | Epsilon: 0.827 | Time: 6.68s
Episode 39 | Total Reward: -985.95 | Avg(10): -1198.48 | Epsilon: 0.822 | Time: 7.23s
Episode 40 | Total Reward: -1598.59 | Avg(10): -1261.77 | Epsilon: 0.818 | Time: 9.50s
Episode 41 | Total Reward: -1135.59 | Avg(10): -1254.01 | Epsilon: 0.814 | Time: 6.35s
Episode 42 | Total Reward: -1462.46 | Avg(10): -1272.23 | Epsilon: 0.810 | Time: 5.39s
Episode 43 | Total Reward: -1179.69 | Avg(10): -1283.75 | Epsilon: 0.806 | Time: 5.66s
Episode 44 | Total Reward: -1302.68 | Avg(10): -1276.96 | Epsilon: 0.802 | Time: 5.71s
Episode 45 | Total Reward: -1582.33 | Avg(10): -1292.13 | Epsilon: 0.798 | Time: 6.04s
Episode 46 | Total Reward: -1464.31 | Avg(10): -1335.01 | Epsilon: 0.794 | Time: 5.96s
Episode 47 | Total Reward: -1201.03 | Avg(10): -1346.02 | Epsilon: 0.790 | Time: 6.07s
Episode 48 | Total Reward: -973.27 | Avg(10): -1288.59 | Epsilon: 0.786 | Time: 5.84s
Episode 49 | Total Reward: -874.03 | Avg(10): -1277.40 | Epsilon: 0.782 | Time: 6.41s
Episode 50 | Total Reward: -1167.51 | Avg(10): -1234.29 | Epsilon: 0.778 | Time: 6.59s
Episode 51 | Total Reward: -1120.31 | Avg(10): -1232.76 | Epsilon: 0.774 | Time: 6.61s
Episode 52 | Total Reward: -870.76 | Avg(10): -1173.59 | Epsilon: 0.771 | Time: 6.85s
Episode 53 | Total Reward: -758.36 | Avg(10): -1131.46 | Epsilon: 0.767 | Time: 5.99s
Episode 54 | Total Reward: -1435.43 | Avg(10): -1144.73 | Epsilon: 0.763 | Time: 6.17s
Episode 55 | Total Reward: -1048.65 | Avg(10): -1091.37 | Epsilon: 0.759 | Time: 6.25s
Episode 56 | Total Reward: -755.39 | Avg(10): -1020.47 | Epsilon: 0.755 | Time: 7.33s
Episode 57 | Total Reward: -911.69 | Avg(10): -991.54 | Epsilon: 0.751 | Time: 6.75s
Episode 58 | Total Reward: -1107.93 | Avg(10): -1005.01 | Epsilon: 0.748 | Time: 6.73s
Episode 59 | Total Reward: -1215.78 | Avg(10): -1039.18 | Epsilon: 0.744 | Time: 6.55s
Episode 60 | Total Reward: -1269.58 | Avg(10): -1049.39 | Epsilon: 0.740 | Time: 10.48s
Episode 61 | Total Reward: -1107.06 | Avg(10): -1048.06 | Epsilon: 0.737 | Time: 7.50s
Episode 62 | Total Reward: -1358.86 | Avg(10): -1096.87 | Epsilon: 0.733 | Time: 6.91s
Episode 63 | Total Reward: -981.40 | Avg(10): -1119.18 | Epsilon: 0.729 | Time: 6.61s
Episode 64 | Total Reward: -1313.63 | Avg(10): -1107.00 | Epsilon: 0.726 | Time: 6.58s
Episode 65 | Total Reward: -1163.72 | Avg(10): -1118.50 | Epsilon: 0.722 | Time: 6.47s
Episode 66 | Total Reward: -1037.72 | Avg(10): -1146.74 | Epsilon: 0.718 | Time: 6.31s
Episode 67 | Total Reward: -986.91 | Avg(10): -1154.26 | Epsilon: 0.715 | Time: 6.19s
Episode 68 | Total Reward: -1178.54 | Avg(10): -1161.32 | Epsilon: 0.711 | Time: 6.28s
Episode 69 | Total Reward: -1008.27 | Avg(10): -1140.57 | Epsilon: 0.708 | Time: 6.23s
Episode 70 | Total Reward: -1127.64 | Avg(10): -1126.38 | Epsilon: 0.704 | Time: 5.90s
Episode 71 | Total Reward: -1031.86 | Avg(10): -1118.86 | Epsilon: 0.701 | Time: 6.22s
Episode 72 | Total Reward: -851.78 | Avg(10): -1068.15 | Epsilon: 0.697 | Time: 5.86s
Episode 73 | Total Reward: -964.26 | Avg(10): -1066.43 | Epsilon: 0.694 | Time: 6.14s
Episode 74 | Total Reward: -1141.14 | Avg(10): -1049.19 | Epsilon: 0.690 | Time: 5.82s
Episode 75 | Total Reward: -847.69 | Avg(10): -1017.58 | Epsilon: 0.687 | Time: 6.25s
Episode 76 | Total Reward: -1183.32 | Avg(10): -1032.14 | Epsilon: 0.683 | Time: 5.95s
Episode 77 | Total Reward: -1118.76 | Avg(10): -1045.33 | Epsilon: 0.680 | Time: 6.06s
Episode 78 | Total Reward: -1243.02 | Avg(10): -1051.77 | Epsilon: 0.676 | Time: 6.12s
Episode 79 | Total Reward: -1140.67 | Avg(10): -1065.01 | Epsilon: 0.673 | Time: 4.11s
Episode 80 | Total Reward: -1128.06 | Avg(10): -1065.06 | Epsilon: 0.670 | Time: 9.52s
Episode 81 | Total Reward: -1017.69 | Avg(10): -1063.64 | Epsilon: 0.666 | Time: 6.79s
Episode 82 | Total Reward: -1196.33 | Avg(10): -1098.09 | Epsilon: 0.663 | Time: 7.41s
Episode 83 | Total Reward: -1157.73 | Avg(10): -1117.44 | Epsilon: 0.660 | Time: 7.68s
Episode 84 | Total Reward: -881.83 | Avg(10): -1091.51 | Epsilon: 0.656 | Time: 6.63s
Episode 85 | Total Reward: -1132.29 | Avg(10): -1119.97 | Epsilon: 0.653 | Time: 7.11s
Episode 86 | Total Reward: -768.72 | Avg(10): -1078.51 | Epsilon: 0.650 | Time: 7.12s
Episode 87 | Total Reward: -1025.97 | Avg(10): -1069.23 | Epsilon: 0.647 | Time: 6.95s
Episode 88 | Total Reward: -1045.15 | Avg(10): -1049.44 | Epsilon: 0.643 | Time: 7.49s
Episode 89 | Total Reward: -903.80 | Avg(10): -1025.76 | Epsilon: 0.640 | Time: 7.15s
Episode 90 | Total Reward: -1130.77 | Avg(10): -1026.03 | Epsilon: 0.637 | Time: 7.16s
Episode 91 | Total Reward: -1103.63 | Avg(10): -1034.62 | Epsilon: 0.634 | Time: 6.79s
Episode 92 | Total Reward: -885.53 | Avg(10): -1003.54 | Epsilon: 0.631 | Time: 6.88s
Episode 93 | Total Reward: -993.94 | Avg(10): -987.16 | Epsilon: 0.627 | Time: 7.08s
Episode 94 | Total Reward: -893.37 | Avg(10): -988.32 | Epsilon: 0.624 | Time: 6.42s
Episode 95 | Total Reward: -1011.48 | Avg(10): -976.24 | Epsilon: 0.621 | Time: 6.53s
Episode 96 | Total Reward: -901.23 | Avg(10): -989.49 | Epsilon: 0.618 | Time: 6.46s
Episode 97 | Total Reward: -1088.49 | Avg(10): -995.74 | Epsilon: 0.615 | Time: 6.51s
Episode 98 | Total Reward: -641.20 | Avg(10): -955.34 | Epsilon: 0.612 | Time: 6.43s
Episode 99 | Total Reward: -747.77 | Avg(10): -939.74 | Epsilon: 0.609 | Time: 7.05s
Episode 100 | Total Reward: -864.88 | Avg(10): -913.15 | Epsilon: 0.606 | Time: 9.31s
Episode 101 | Total Reward: -772.94 | Avg(10): -880.08 | Epsilon: 0.603 | Time: 6.63s
Episode 102 | Total Reward: -1013.65 | Avg(10): -892.90 | Epsilon: 0.600 | Time: 7.20s
Episode 103 | Total Reward: -509.11 | Avg(10): -844.41 | Epsilon: 0.597 | Time: 7.11s
Episode 104 | Total Reward: -615.93 | Avg(10): -816.67 | Epsilon: 0.594 | Time: 7.12s
Episode 105 | Total Reward: -588.57 | Avg(10): -774.38 | Epsilon: 0.591 | Time: 6.67s
Episode 106 | Total Reward: -871.67 | Avg(10): -771.42 | Epsilon: 0.588 | Time: 6.69s
Episode 107 | Total Reward: -381.20 | Avg(10): -700.69 | Epsilon: 0.585 | Time: 6.29s
Episode 108 | Total Reward: -972.60 | Avg(10): -733.83 | Epsilon: 0.582 | Time: 6.32s
Episode 109 | Total Reward: -1000.44 | Avg(10): -759.10 | Epsilon: 0.579 | Time: 6.78s
Episode 110 | Total Reward: -772.36 | Avg(10): -749.85 | Epsilon: 0.576 | Time: 6.00s
Episode 111 | Total Reward: -879.42 | Avg(10): -760.49 | Epsilon: 0.573 | Time: 5.95s
Episode 112 | Total Reward: -752.63 | Avg(10): -734.39 | Epsilon: 0.570 | Time: 6.15s
Episode 113 | Total Reward: -842.52 | Avg(10): -767.73 | Epsilon: 0.568 | Time: 5.98s
Episode 114 | Total Reward: -1041.18 | Avg(10): -810.26 | Epsilon: 0.565 | Time: 6.32s
Episode 115 | Total Reward: -878.47 | Avg(10): -839.25 | Epsilon: 0.562 | Time: 6.38s
Episode 116 | Total Reward: -633.00 | Avg(10): -815.38 | Epsilon: 0.559 | Time: 6.65s
Episode 117 | Total Reward: -722.40 | Avg(10): -849.50 | Epsilon: 0.556 | Time: 6.37s
Episode 118 | Total Reward: -628.24 | Avg(10): -815.07 | Epsilon: 0.554 | Time: 6.33s
Episode 119 | Total Reward: -966.83 | Avg(10): -811.71 | Epsilon: 0.551 | Time: 6.01s
Episode 120 | Total Reward: -813.67 | Avg(10): -815.84 | Epsilon: 0.548 | Time: 9.18s
Episode 121 | Total Reward: -931.88 | Avg(10): -821.08 | Epsilon: 0.545 | Time: 6.56s
Episode 122 | Total Reward: -639.46 | Avg(10): -809.77 | Epsilon: 0.543 | Time: 6.34s
Episode 123 | Total Reward: -934.82 | Avg(10): -819.00 | Epsilon: 0.540 | Time: 3.76s
Episode 124 | Total Reward: -494.58 | Avg(10): -764.34 | Epsilon: 0.537 | Time: 2.88s
Episode 125 | Total Reward: -918.75 | Avg(10): -768.36 | Epsilon: 0.534 | Time: 3.47s
Episode 126 | Total Reward: -715.84 | Avg(10): -776.65 | Epsilon: 0.532 | Time: 2.94s
Episode 127 | Total Reward: -754.57 | Avg(10): -779.86 | Epsilon: 0.529 | Time: 3.00s
Episode 128 | Total Reward: -497.68 | Avg(10): -766.81 | Epsilon: 0.526 | Time: 3.10s
Episode 129 | Total Reward: -607.75 | Avg(10): -730.90 | Epsilon: 0.524 | Time: 2.94s
Episode 130 | Total Reward: -608.95 | Avg(10): -710.43 | Epsilon: 0.521 | Time: 3.23s
Episode 131 | Total Reward: -617.55 | Avg(10): -679.00 | Epsilon: 0.519 | Time: 3.10s
Episode 132 | Total Reward: -926.39 | Avg(10): -707.69 | Epsilon: 0.516 | Time: 3.02s
Episode 133 | Total Reward: -970.22 | Avg(10): -711.23 | Epsilon: 0.513 | Time: 3.05s
Episode 134 | Total Reward: -599.45 | Avg(10): -721.71 | Epsilon: 0.511 | Time: 3.16s
Episode 135 | Total Reward: -389.72 | Avg(10): -668.81 | Epsilon: 0.508 | Time: 2.99s
Episode 136 | Total Reward: -483.83 | Avg(10): -645.61 | Epsilon: 0.506 | Time: 2.59s
Episode 137 | Total Reward: -644.53 | Avg(10): -634.61 | Epsilon: 0.503 | Time: 2.70s
Episode 138 | Total Reward: -364.20 | Avg(10): -621.26 | Epsilon: 0.501 | Time: 2.86s
Episode 139 | Total Reward: -509.58 | Avg(10): -611.44 | Epsilon: 0.498 | Time: 3.03s
Episode 140 | Total Reward: -506.60 | Avg(10): -601.21 | Epsilon: 0.496 | Time: 4.22s
Episode 141 | Total Reward: -647.71 | Avg(10): -604.22 | Epsilon: 0.493 | Time: 2.66s
Episode 142 | Total Reward: -507.46 | Avg(10): -562.33 | Epsilon: 0.491 | Time: 2.64s
Episode 143 | Total Reward: -700.12 | Avg(10): -535.32 | Epsilon: 0.488 | Time: 2.86s
Episode 144 | Total Reward: -758.75 | Avg(10): -551.25 | Epsilon: 0.486 | Time: 2.58s
Episode 145 | Total Reward: -614.86 | Avg(10): -573.76 | Epsilon: 0.483 | Time: 2.69s
Episode 146 | Total Reward: -280.02 | Avg(10): -553.38 | Epsilon: 0.481 | Time: 2.51s
Episode 147 | Total Reward: -491.35 | Avg(10): -538.07 | Epsilon: 0.479 | Time: 2.63s
Episode 148 | Total Reward: -376.58 | Avg(10): -539.30 | Epsilon: 0.476 | Time: 2.63s
Episode 149 | Total Reward: -904.39 | Avg(10): -578.78 | Epsilon: 0.474 | Time: 2.56s
Episode 150 | Total Reward: -677.33 | Avg(10): -595.86 | Epsilon: 0.471 | Time: 2.63s
Episode 151 | Total Reward: -412.20 | Avg(10): -572.30 | Epsilon: 0.469 | Time: 2.75s
Episode 152 | Total Reward: -253.94 | Avg(10): -546.95 | Epsilon: 0.467 | Time: 2.78s
Episode 153 | Total Reward: -752.86 | Avg(10): -552.23 | Epsilon: 0.464 | Time: 2.61s
Episode 154 | Total Reward: -488.08 | Avg(10): -525.16 | Epsilon: 0.462 | Time: 2.78s
Episode 155 | Total Reward: -380.19 | Avg(10): -501.69 | Epsilon: 0.460 | Time: 2.69s
Episode 156 | Total Reward: -747.57 | Avg(10): -548.45 | Epsilon: 0.458 | Time: 2.78s
Episode 157 | Total Reward: -487.45 | Avg(10): -548.06 | Epsilon: 0.455 | Time: 2.89s
Episode 158 | Total Reward: -500.66 | Avg(10): -560.47 | Epsilon: 0.453 | Time: 2.62s
Episode 159 | Total Reward: -556.02 | Avg(10): -525.63 | Epsilon: 0.451 | Time: 2.36s
Episode 160 | Total Reward: -376.17 | Avg(10): -495.51 | Epsilon: 0.448 | Time: 3.58s
Episode 161 | Total Reward: -379.70 | Avg(10): -492.26 | Epsilon: 0.446 | Time: 2.29s
Episode 162 | Total Reward: -606.56 | Avg(10): -527.53 | Epsilon: 0.444 | Time: 2.29s
Episode 163 | Total Reward: -488.00 | Avg(10): -501.04 | Epsilon: 0.442 | Time: 2.52s
Episode 164 | Total Reward: -255.80 | Avg(10): -477.81 | Epsilon: 0.440 | Time: 2.58s
Episode 165 | Total Reward: -512.74 | Avg(10): -491.07 | Epsilon: 0.437 | Time: 2.49s
Episode 166 | Total Reward: -359.67 | Avg(10): -452.28 | Epsilon: 0.435 | Time: 2.48s
Episode 167 | Total Reward: -505.08 | Avg(10): -454.04 | Epsilon: 0.433 | Time: 2.44s
Episode 168 | Total Reward: -490.60 | Avg(10): -453.03 | Epsilon: 0.431 | Time: 2.56s
Episode 169 | Total Reward: -252.20 | Avg(10): -422.65 | Epsilon: 0.429 | Time: 2.53s
Episode 170 | Total Reward: -363.64 | Avg(10): -421.40 | Epsilon: 0.427 | Time: 2.38s
Episode 171 | Total Reward: -127.05 | Avg(10): -396.13 | Epsilon: 0.424 | Time: 2.47s
Episode 172 | Total Reward: -492.50 | Avg(10): -384.73 | Epsilon: 0.422 | Time: 2.47s
Episode 173 | Total Reward: -495.89 | Avg(10): -385.52 | Epsilon: 0.420 | Time: 2.53s
Episode 174 | Total Reward: -253.39 | Avg(10): -385.28 | Epsilon: 0.418 | Time: 2.56s
Episode 175 | Total Reward: -499.37 | Avg(10): -383.94 | Epsilon: 0.416 | Time: 2.61s
Episode 176 | Total Reward: -281.99 | Avg(10): -376.17 | Epsilon: 0.414 | Time: 2.87s
Episode 177 | Total Reward: -503.31 | Avg(10): -376.00 | Epsilon: 0.412 | Time: 2.91s
Episode 178 | Total Reward: -376.13 | Avg(10): -364.55 | Epsilon: 0.410 | Time: 2.85s
Episode 179 | Total Reward: -242.30 | Avg(10): -363.56 | Epsilon: 0.408 | Time: 2.90s
Episode 180 | Total Reward: -478.83 | Avg(10): -375.08 | Epsilon: 0.406 | Time: 4.32s
Episode 181 | Total Reward: -471.53 | Avg(10): -409.52 | Epsilon: 0.404 | Time: 2.72s
Episode 182 | Total Reward: -401.05 | Avg(10): -400.38 | Epsilon: 0.402 | Time: 2.72s
Episode 183 | Total Reward: -250.37 | Avg(10): -375.83 | Epsilon: 0.400 | Time: 3.03s
Episode 184 | Total Reward: -252.24 | Avg(10): -375.71 | Epsilon: 0.398 | Time: 3.05s
Episode 185 | Total Reward: -362.14 | Avg(10): -361.99 | Epsilon: 0.396 | Time: 2.76s
Episode 186 | Total Reward: -377.28 | Avg(10): -371.52 | Epsilon: 0.394 | Time: 3.12s
Episode 187 | Total Reward: -607.04 | Avg(10): -381.89 | Epsilon: 0.392 | Time: 2.93s
Episode 188 | Total Reward: -243.20 | Avg(10): -368.60 | Epsilon: 0.390 | Time: 2.92s
Episode 189 | Total Reward: -488.91 | Avg(10): -393.26 | Epsilon: 0.388 | Time: 2.87s
Episode 190 | Total Reward: -125.63 | Avg(10): -357.94 | Epsilon: 0.386 | Time: 2.99s
Episode 191 | Total Reward: -128.26 | Avg(10): -323.61 | Epsilon: 0.384 | Time: 2.70s
Episode 192 | Total Reward: -619.87 | Avg(10): -345.49 | Epsilon: 0.382 | Time: 2.45s
Episode 193 | Total Reward: -243.37 | Avg(10): -344.79 | Epsilon: 0.380 | Time: 2.44s
Episode 194 | Total Reward: -252.59 | Avg(10): -344.83 | Epsilon: 0.378 | Time: 2.47s
Episode 195 | Total Reward: -246.37 | Avg(10): -333.25 | Epsilon: 0.376 | Time: 2.42s
Episode 196 | Total Reward: -124.44 | Avg(10): -307.97 | Epsilon: 0.374 | Time: 2.48s
Episode 197 | Total Reward: -367.61 | Avg(10): -284.02 | Epsilon: 0.373 | Time: 2.51s
Episode 198 | Total Reward: -381.20 | Avg(10): -297.83 | Epsilon: 0.371 | Time: 2.76s
Episode 199 | Total Reward: -128.23 | Avg(10): -261.76 | Epsilon: 0.369 | Time: 2.91s
Episode 200 | Total Reward: -602.97 | Avg(10): -309.49 | Epsilon: 0.367 | Time: 4.09s
Best average reward over 10 episodes: -261.76
Best model weights saved to: dqn_pendulum_5actions_weights.h5
Total training time: 1009.47s
Test Episode 1: Total Reward = -260.45
Test Episode 2: Total Reward = -246.17
Test Episode 3: Total Reward = -242.18
Test Episode 4: Total Reward = -371.22
Test Episode 5: Total Reward = -508.48
Test Episode 6: Total Reward = -252.90
Test Episode 7: Total Reward = -250.34
Test Episode 8: Total Reward = -248.45
Test Episode 9: Total Reward = -247.91
Test Episode 10: Total Reward = -367.01

Average Reward over 10 episodes: -299.51 ± 84.19
Saved best episode GIF to dqn_pendulum_5actions_eval_best.gif
Saved worst episode GIF to dqn_pendulum_5actions_eval_worst.gif
Saved average episode GIF to dqn_pendulum_5actions_eval_average.gif
============================================================
============================================================
Running experiment with N_ACTIONS = 11

Model Summary:
Model: "dqn_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_6 (Dense)             multiple                  256       
                                                                 
 dense_7 (Dense)             multiple                  4160      
                                                                 
 dense_8 (Dense)             multiple                  715       
                                                                 
=================================================================
Total params: 5131 (20.04 KB)
Trainable params: 5131 (20.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1194.38 | Avg(10): -1194.38 | Epsilon: 0.995 | Time: 0.02s
Episode 2 | Total Reward: -1765.43 | Avg(10): -1479.91 | Epsilon: 0.990 | Time: 0.06s
Episode 3 | Total Reward: -996.23 | Avg(10): -1318.68 | Epsilon: 0.985 | Time: 0.04s
Episode 4 | Total Reward: -1506.81 | Avg(10): -1365.71 | Epsilon: 0.980 | Time: 0.03s
Episode 5 | Total Reward: -1198.77 | Avg(10): -1332.32 | Epsilon: 0.975 | Time: 0.16s
Episode 6 | Total Reward: -1636.11 | Avg(10): -1382.95 | Epsilon: 0.970 | Time: 8.05s
Episode 7 | Total Reward: -878.74 | Avg(10): -1310.92 | Epsilon: 0.966 | Time: 6.25s
Episode 8 | Total Reward: -870.16 | Avg(10): -1255.83 | Epsilon: 0.961 | Time: 11.45s
Episode 9 | Total Reward: -1594.76 | Avg(10): -1293.49 | Epsilon: 0.956 | Time: 7.30s
Episode 10 | Total Reward: -1035.19 | Avg(10): -1267.66 | Epsilon: 0.951 | Time: 7.33s
Episode 11 | Total Reward: -991.84 | Avg(10): -1247.40 | Epsilon: 0.946 | Time: 12.99s
Episode 12 | Total Reward: -1159.30 | Avg(10): -1186.79 | Epsilon: 0.942 | Time: 10.31s
Episode 13 | Total Reward: -1540.15 | Avg(10): -1241.18 | Epsilon: 0.937 | Time: 9.71s
Episode 14 | Total Reward: -1373.46 | Avg(10): -1227.85 | Epsilon: 0.932 | Time: 11.10s
Episode 15 | Total Reward: -1630.50 | Avg(10): -1271.02 | Epsilon: 0.928 | Time: 9.61s
Episode 16 | Total Reward: -1504.62 | Avg(10): -1257.87 | Epsilon: 0.923 | Time: 5.17s
Episode 17 | Total Reward: -1434.75 | Avg(10): -1313.47 | Epsilon: 0.918 | Time: 5.47s
Episode 18 | Total Reward: -953.42 | Avg(10): -1321.80 | Epsilon: 0.914 | Time: 5.42s
Episode 19 | Total Reward: -1149.61 | Avg(10): -1277.28 | Epsilon: 0.909 | Time: 6.30s
Episode 20 | Total Reward: -1222.15 | Avg(10): -1295.98 | Epsilon: 0.905 | Time: 6.40s
Episode 21 | Total Reward: -1497.13 | Avg(10): -1346.51 | Epsilon: 0.900 | Time: 2.36s
Episode 22 | Total Reward: -1470.51 | Avg(10): -1377.63 | Epsilon: 0.896 | Time: 2.25s
Episode 23 | Total Reward: -1625.98 | Avg(10): -1386.21 | Epsilon: 0.891 | Time: 2.31s
Episode 24 | Total Reward: -1689.35 | Avg(10): -1417.80 | Epsilon: 0.887 | Time: 2.26s
Episode 25 | Total Reward: -1591.05 | Avg(10): -1413.86 | Epsilon: 0.882 | Time: 2.27s
Episode 26 | Total Reward: -1188.23 | Avg(10): -1382.22 | Epsilon: 0.878 | Time: 2.27s
Episode 27 | Total Reward: -1192.96 | Avg(10): -1358.04 | Epsilon: 0.873 | Time: 2.26s
Episode 28 | Total Reward: -860.09 | Avg(10): -1348.71 | Epsilon: 0.869 | Time: 2.22s
Episode 29 | Total Reward: -1392.15 | Avg(10): -1372.96 | Epsilon: 0.865 | Time: 2.26s
Episode 30 | Total Reward: -1695.56 | Avg(10): -1420.30 | Epsilon: 0.860 | Time: 2.29s
Episode 31 | Total Reward: -1705.25 | Avg(10): -1441.11 | Epsilon: 0.856 | Time: 2.24s
Episode 32 | Total Reward: -1338.20 | Avg(10): -1427.88 | Epsilon: 0.852 | Time: 2.30s
Episode 33 | Total Reward: -1650.03 | Avg(10): -1430.29 | Epsilon: 0.848 | Time: 2.30s
Episode 34 | Total Reward: -1317.37 | Avg(10): -1393.09 | Epsilon: 0.843 | Time: 2.29s
Episode 35 | Total Reward: -1284.49 | Avg(10): -1362.43 | Epsilon: 0.839 | Time: 2.22s
Episode 36 | Total Reward: -1527.78 | Avg(10): -1396.39 | Epsilon: 0.835 | Time: 2.25s
Episode 37 | Total Reward: -1344.17 | Avg(10): -1411.51 | Epsilon: 0.831 | Time: 2.28s
Episode 38 | Total Reward: -1293.11 | Avg(10): -1454.81 | Epsilon: 0.827 | Time: 2.15s
Episode 39 | Total Reward: -876.33 | Avg(10): -1403.23 | Epsilon: 0.822 | Time: 2.26s
Episode 40 | Total Reward: -1251.06 | Avg(10): -1358.78 | Epsilon: 0.818 | Time: 3.64s
Episode 41 | Total Reward: -1477.58 | Avg(10): -1336.01 | Epsilon: 0.814 | Time: 2.25s
Episode 42 | Total Reward: -866.61 | Avg(10): -1288.85 | Epsilon: 0.810 | Time: 2.21s
Episode 43 | Total Reward: -1087.93 | Avg(10): -1232.64 | Epsilon: 0.806 | Time: 2.20s
Episode 44 | Total Reward: -1505.75 | Avg(10): -1251.48 | Epsilon: 0.802 | Time: 2.34s
Episode 45 | Total Reward: -1208.18 | Avg(10): -1243.85 | Epsilon: 0.798 | Time: 2.20s
Episode 46 | Total Reward: -1079.56 | Avg(10): -1199.03 | Epsilon: 0.794 | Time: 2.27s
Episode 47 | Total Reward: -951.24 | Avg(10): -1159.74 | Epsilon: 0.790 | Time: 2.22s
Episode 48 | Total Reward: -890.28 | Avg(10): -1119.45 | Epsilon: 0.786 | Time: 2.24s
Episode 49 | Total Reward: -1366.11 | Avg(10): -1168.43 | Epsilon: 0.782 | Time: 2.29s
Episode 50 | Total Reward: -953.89 | Avg(10): -1138.71 | Epsilon: 0.778 | Time: 2.25s
Episode 51 | Total Reward: -890.94 | Avg(10): -1080.05 | Epsilon: 0.774 | Time: 2.25s
Episode 52 | Total Reward: -871.41 | Avg(10): -1080.53 | Epsilon: 0.771 | Time: 2.26s
Episode 53 | Total Reward: -1284.21 | Avg(10): -1100.16 | Epsilon: 0.767 | Time: 2.43s
Episode 54 | Total Reward: -991.56 | Avg(10): -1048.74 | Epsilon: 0.763 | Time: 2.36s
Episode 55 | Total Reward: -1204.18 | Avg(10): -1048.34 | Epsilon: 0.759 | Time: 2.31s
Episode 56 | Total Reward: -914.21 | Avg(10): -1031.80 | Epsilon: 0.755 | Time: 2.29s
Episode 57 | Total Reward: -1066.01 | Avg(10): -1043.28 | Epsilon: 0.751 | Time: 2.28s
Episode 58 | Total Reward: -1073.48 | Avg(10): -1061.60 | Epsilon: 0.748 | Time: 2.31s
Episode 59 | Total Reward: -868.06 | Avg(10): -1011.80 | Epsilon: 0.744 | Time: 3.37s
Episode 60 | Total Reward: -865.54 | Avg(10): -1002.96 | Epsilon: 0.740 | Time: 5.42s
Episode 61 | Total Reward: -901.30 | Avg(10): -1004.00 | Epsilon: 0.737 | Time: 2.58s
Episode 62 | Total Reward: -869.15 | Avg(10): -1003.77 | Epsilon: 0.733 | Time: 2.31s
Episode 63 | Total Reward: -1010.20 | Avg(10): -976.37 | Epsilon: 0.729 | Time: 2.20s
Episode 64 | Total Reward: -875.17 | Avg(10): -964.73 | Epsilon: 0.726 | Time: 2.21s
Episode 65 | Total Reward: -731.43 | Avg(10): -917.45 | Epsilon: 0.722 | Time: 2.26s
Episode 66 | Total Reward: -1077.12 | Avg(10): -933.75 | Epsilon: 0.718 | Time: 2.25s
Episode 67 | Total Reward: -870.96 | Avg(10): -914.24 | Epsilon: 0.715 | Time: 2.36s
Episode 68 | Total Reward: -1310.21 | Avg(10): -937.91 | Epsilon: 0.711 | Time: 2.29s
Episode 69 | Total Reward: -1303.96 | Avg(10): -981.50 | Epsilon: 0.708 | Time: 2.23s
Episode 70 | Total Reward: -855.78 | Avg(10): -980.53 | Epsilon: 0.704 | Time: 2.25s
Episode 71 | Total Reward: -635.77 | Avg(10): -953.97 | Epsilon: 0.701 | Time: 2.38s
Episode 72 | Total Reward: -971.98 | Avg(10): -964.26 | Epsilon: 0.697 | Time: 2.24s
Episode 73 | Total Reward: -967.97 | Avg(10): -960.03 | Epsilon: 0.694 | Time: 2.22s
Episode 74 | Total Reward: -769.64 | Avg(10): -949.48 | Epsilon: 0.690 | Time: 2.21s
Episode 75 | Total Reward: -761.42 | Avg(10): -952.48 | Epsilon: 0.687 | Time: 2.32s
Episode 76 | Total Reward: -1022.89 | Avg(10): -947.06 | Epsilon: 0.683 | Time: 2.26s
Episode 77 | Total Reward: -895.05 | Avg(10): -949.47 | Epsilon: 0.680 | Time: 2.23s
Episode 78 | Total Reward: -910.71 | Avg(10): -909.52 | Epsilon: 0.676 | Time: 2.38s
Episode 79 | Total Reward: -1057.19 | Avg(10): -884.84 | Epsilon: 0.673 | Time: 2.29s
Episode 80 | Total Reward: -568.47 | Avg(10): -856.11 | Epsilon: 0.670 | Time: 3.72s
Episode 81 | Total Reward: -1113.52 | Avg(10): -903.88 | Epsilon: 0.666 | Time: 2.28s
Episode 82 | Total Reward: -1151.25 | Avg(10): -921.81 | Epsilon: 0.663 | Time: 2.27s
Episode 83 | Total Reward: -509.55 | Avg(10): -875.97 | Epsilon: 0.660 | Time: 2.43s
Episode 84 | Total Reward: -977.89 | Avg(10): -896.79 | Epsilon: 0.656 | Time: 2.26s
Episode 85 | Total Reward: -1239.06 | Avg(10): -944.56 | Epsilon: 0.653 | Time: 2.30s
Episode 86 | Total Reward: -1016.66 | Avg(10): -943.93 | Epsilon: 0.650 | Time: 2.24s
Episode 87 | Total Reward: -966.86 | Avg(10): -951.12 | Epsilon: 0.647 | Time: 2.25s
Episode 88 | Total Reward: -1062.58 | Avg(10): -966.30 | Epsilon: 0.643 | Time: 2.33s
Episode 89 | Total Reward: -942.99 | Avg(10): -954.88 | Epsilon: 0.640 | Time: 2.29s
Episode 90 | Total Reward: -883.95 | Avg(10): -986.43 | Epsilon: 0.637 | Time: 2.50s
Episode 91 | Total Reward: -1131.10 | Avg(10): -988.19 | Epsilon: 0.634 | Time: 2.65s
Episode 92 | Total Reward: -1193.47 | Avg(10): -992.41 | Epsilon: 0.631 | Time: 2.86s
Episode 93 | Total Reward: -1075.36 | Avg(10): -1048.99 | Epsilon: 0.627 | Time: 2.98s
Episode 94 | Total Reward: -1141.64 | Avg(10): -1065.37 | Epsilon: 0.624 | Time: 2.54s
Episode 95 | Total Reward: -1035.19 | Avg(10): -1044.98 | Epsilon: 0.621 | Time: 2.62s
Episode 96 | Total Reward: -1083.68 | Avg(10): -1051.68 | Epsilon: 0.618 | Time: 2.42s
Episode 97 | Total Reward: -1186.46 | Avg(10): -1073.64 | Epsilon: 0.615 | Time: 2.35s
Episode 98 | Total Reward: -1028.04 | Avg(10): -1070.19 | Epsilon: 0.612 | Time: 2.32s
Episode 99 | Total Reward: -963.33 | Avg(10): -1072.22 | Epsilon: 0.609 | Time: 2.33s
Episode 100 | Total Reward: -899.80 | Avg(10): -1073.81 | Epsilon: 0.606 | Time: 3.89s
Episode 101 | Total Reward: -1044.36 | Avg(10): -1065.13 | Epsilon: 0.603 | Time: 2.45s
Episode 102 | Total Reward: -867.83 | Avg(10): -1032.57 | Epsilon: 0.600 | Time: 2.31s
Episode 103 | Total Reward: -876.90 | Avg(10): -1012.72 | Epsilon: 0.597 | Time: 2.42s
Episode 104 | Total Reward: -906.04 | Avg(10): -989.16 | Epsilon: 0.594 | Time: 2.38s
Episode 105 | Total Reward: -1230.67 | Avg(10): -1008.71 | Epsilon: 0.591 | Time: 2.47s
Episode 106 | Total Reward: -663.25 | Avg(10): -966.67 | Epsilon: 0.588 | Time: 2.39s
Episode 107 | Total Reward: -767.69 | Avg(10): -924.79 | Epsilon: 0.585 | Time: 2.37s
Episode 108 | Total Reward: -855.18 | Avg(10): -907.51 | Epsilon: 0.582 | Time: 2.31s
Episode 109 | Total Reward: -750.11 | Avg(10): -886.18 | Epsilon: 0.579 | Time: 2.47s
Episode 110 | Total Reward: -629.60 | Avg(10): -859.16 | Epsilon: 0.576 | Time: 2.31s
Episode 111 | Total Reward: -666.22 | Avg(10): -821.35 | Epsilon: 0.573 | Time: 2.30s
Episode 112 | Total Reward: -1001.02 | Avg(10): -834.67 | Epsilon: 0.570 | Time: 2.36s
Episode 113 | Total Reward: -513.68 | Avg(10): -798.35 | Epsilon: 0.568 | Time: 2.30s
Episode 114 | Total Reward: -612.72 | Avg(10): -769.01 | Epsilon: 0.565 | Time: 2.27s
Episode 115 | Total Reward: -632.60 | Avg(10): -709.21 | Epsilon: 0.562 | Time: 2.31s
Episode 116 | Total Reward: -380.81 | Avg(10): -680.96 | Epsilon: 0.559 | Time: 2.32s
Episode 117 | Total Reward: -380.92 | Avg(10): -642.29 | Epsilon: 0.556 | Time: 2.32s
Episode 118 | Total Reward: -633.91 | Avg(10): -620.16 | Epsilon: 0.554 | Time: 2.35s
Episode 119 | Total Reward: -758.88 | Avg(10): -621.04 | Epsilon: 0.551 | Time: 2.31s
Episode 120 | Total Reward: -1025.04 | Avg(10): -660.58 | Epsilon: 0.548 | Time: 3.59s
Episode 121 | Total Reward: -636.74 | Avg(10): -657.63 | Epsilon: 0.545 | Time: 2.24s
Episode 122 | Total Reward: -634.20 | Avg(10): -620.95 | Epsilon: 0.543 | Time: 2.25s
Episode 123 | Total Reward: -652.90 | Avg(10): -634.87 | Epsilon: 0.540 | Time: 2.25s
Episode 124 | Total Reward: -1084.07 | Avg(10): -682.01 | Epsilon: 0.537 | Time: 2.24s
Episode 125 | Total Reward: -441.31 | Avg(10): -662.88 | Epsilon: 0.534 | Time: 2.37s
Episode 126 | Total Reward: -720.20 | Avg(10): -696.82 | Epsilon: 0.532 | Time: 2.25s
Episode 127 | Total Reward: -772.79 | Avg(10): -736.00 | Epsilon: 0.529 | Time: 2.25s
Episode 128 | Total Reward: -880.73 | Avg(10): -760.69 | Epsilon: 0.526 | Time: 2.24s
Episode 129 | Total Reward: -620.55 | Avg(10): -746.85 | Epsilon: 0.524 | Time: 2.26s
Episode 130 | Total Reward: -622.49 | Avg(10): -706.60 | Epsilon: 0.521 | Time: 2.28s
Episode 131 | Total Reward: -505.62 | Avg(10): -693.49 | Epsilon: 0.519 | Time: 2.27s
Episode 132 | Total Reward: -773.80 | Avg(10): -707.45 | Epsilon: 0.516 | Time: 2.27s
Episode 133 | Total Reward: -740.46 | Avg(10): -716.20 | Epsilon: 0.513 | Time: 2.31s
Episode 134 | Total Reward: -489.94 | Avg(10): -656.79 | Epsilon: 0.511 | Time: 2.31s
Episode 135 | Total Reward: -714.16 | Avg(10): -684.07 | Epsilon: 0.508 | Time: 2.31s
Episode 136 | Total Reward: -805.08 | Avg(10): -692.56 | Epsilon: 0.506 | Time: 2.25s
Episode 137 | Total Reward: -748.04 | Avg(10): -690.09 | Epsilon: 0.503 | Time: 2.25s
Episode 138 | Total Reward: -505.18 | Avg(10): -652.53 | Epsilon: 0.501 | Time: 2.27s
Episode 139 | Total Reward: -389.21 | Avg(10): -629.40 | Epsilon: 0.498 | Time: 2.23s
Episode 140 | Total Reward: -735.13 | Avg(10): -640.66 | Epsilon: 0.496 | Time: 3.61s
Episode 141 | Total Reward: -745.79 | Avg(10): -664.68 | Epsilon: 0.493 | Time: 2.25s
Episode 142 | Total Reward: -614.64 | Avg(10): -648.76 | Epsilon: 0.491 | Time: 2.38s
Episode 143 | Total Reward: -763.45 | Avg(10): -651.06 | Epsilon: 0.488 | Time: 2.22s
Episode 144 | Total Reward: -1096.81 | Avg(10): -711.75 | Epsilon: 0.486 | Time: 2.28s
Episode 145 | Total Reward: -507.56 | Avg(10): -691.09 | Epsilon: 0.483 | Time: 2.23s
Episode 146 | Total Reward: -107.14 | Avg(10): -621.29 | Epsilon: 0.481 | Time: 2.26s
Episode 147 | Total Reward: -502.74 | Avg(10): -596.77 | Epsilon: 0.479 | Time: 2.27s
Episode 148 | Total Reward: -503.31 | Avg(10): -596.58 | Epsilon: 0.476 | Time: 2.25s
Episode 149 | Total Reward: -523.95 | Avg(10): -610.05 | Epsilon: 0.474 | Time: 2.35s
Episode 150 | Total Reward: -245.52 | Avg(10): -561.09 | Epsilon: 0.471 | Time: 2.32s
Episode 151 | Total Reward: -374.13 | Avg(10): -523.93 | Epsilon: 0.469 | Time: 2.27s
Episode 152 | Total Reward: -251.71 | Avg(10): -487.63 | Epsilon: 0.467 | Time: 2.35s
Episode 153 | Total Reward: -260.40 | Avg(10): -437.33 | Epsilon: 0.464 | Time: 2.30s
Episode 154 | Total Reward: -873.52 | Avg(10): -415.00 | Epsilon: 0.462 | Time: 2.27s
Episode 155 | Total Reward: -615.54 | Avg(10): -425.80 | Epsilon: 0.460 | Time: 2.28s
Episode 156 | Total Reward: -370.79 | Avg(10): -452.16 | Epsilon: 0.458 | Time: 2.31s
Episode 157 | Total Reward: -621.73 | Avg(10): -464.06 | Epsilon: 0.455 | Time: 2.25s
Episode 158 | Total Reward: -616.34 | Avg(10): -475.36 | Epsilon: 0.453 | Time: 2.37s
Episode 159 | Total Reward: -246.92 | Avg(10): -447.66 | Epsilon: 0.451 | Time: 2.31s
Episode 160 | Total Reward: -613.73 | Avg(10): -484.48 | Epsilon: 0.448 | Time: 3.59s
Episode 161 | Total Reward: -789.55 | Avg(10): -526.02 | Epsilon: 0.446 | Time: 2.20s
Episode 162 | Total Reward: -248.77 | Avg(10): -525.73 | Epsilon: 0.444 | Time: 2.28s
Episode 163 | Total Reward: -495.81 | Avg(10): -549.27 | Epsilon: 0.442 | Time: 2.29s
Episode 164 | Total Reward: -490.31 | Avg(10): -510.95 | Epsilon: 0.440 | Time: 2.24s
Episode 165 | Total Reward: -505.74 | Avg(10): -499.97 | Epsilon: 0.437 | Time: 2.31s
Episode 166 | Total Reward: -615.80 | Avg(10): -524.47 | Epsilon: 0.435 | Time: 2.29s
Episode 167 | Total Reward: -169.65 | Avg(10): -479.26 | Epsilon: 0.433 | Time: 2.26s
Episode 168 | Total Reward: -2.59 | Avg(10): -417.89 | Epsilon: 0.431 | Time: 2.35s
Episode 169 | Total Reward: -755.66 | Avg(10): -468.76 | Epsilon: 0.429 | Time: 2.27s
Episode 170 | Total Reward: -662.50 | Avg(10): -473.64 | Epsilon: 0.427 | Time: 2.27s
Episode 171 | Total Reward: -254.03 | Avg(10): -420.08 | Epsilon: 0.424 | Time: 2.32s
Episode 172 | Total Reward: -622.45 | Avg(10): -457.45 | Epsilon: 0.422 | Time: 2.30s
Episode 173 | Total Reward: -619.38 | Avg(10): -469.81 | Epsilon: 0.420 | Time: 2.39s
Episode 174 | Total Reward: -618.34 | Avg(10): -482.61 | Epsilon: 0.418 | Time: 2.29s
Episode 175 | Total Reward: -494.25 | Avg(10): -481.46 | Epsilon: 0.416 | Time: 2.34s
Episode 176 | Total Reward: -374.15 | Avg(10): -457.30 | Epsilon: 0.414 | Time: 2.26s
Episode 177 | Total Reward: -481.09 | Avg(10): -488.44 | Epsilon: 0.412 | Time: 2.30s
Episode 178 | Total Reward: -380.70 | Avg(10): -526.25 | Epsilon: 0.410 | Time: 2.35s
Episode 179 | Total Reward: -380.68 | Avg(10): -488.76 | Epsilon: 0.408 | Time: 2.44s
Episode 180 | Total Reward: -386.19 | Avg(10): -461.13 | Epsilon: 0.406 | Time: 3.98s
Episode 181 | Total Reward: -235.44 | Avg(10): -459.27 | Epsilon: 0.404 | Time: 2.36s
Episode 182 | Total Reward: -486.37 | Avg(10): -445.66 | Epsilon: 0.402 | Time: 2.38s
Episode 183 | Total Reward: -738.17 | Avg(10): -457.54 | Epsilon: 0.400 | Time: 2.31s
Episode 184 | Total Reward: -245.96 | Avg(10): -420.30 | Epsilon: 0.398 | Time: 2.26s
Episode 185 | Total Reward: -197.83 | Avg(10): -390.66 | Epsilon: 0.396 | Time: 2.26s
Episode 186 | Total Reward: -617.73 | Avg(10): -415.02 | Epsilon: 0.394 | Time: 2.27s
Episode 187 | Total Reward: -252.08 | Avg(10): -392.12 | Epsilon: 0.392 | Time: 2.29s
Episode 188 | Total Reward: -493.13 | Avg(10): -403.36 | Epsilon: 0.390 | Time: 2.42s
Episode 189 | Total Reward: -367.36 | Avg(10): -402.03 | Epsilon: 0.388 | Time: 2.31s
Episode 190 | Total Reward: -253.57 | Avg(10): -388.77 | Epsilon: 0.386 | Time: 2.35s
Episode 191 | Total Reward: -244.08 | Avg(10): -389.63 | Epsilon: 0.384 | Time: 2.26s
Episode 192 | Total Reward: -516.71 | Avg(10): -392.66 | Epsilon: 0.382 | Time: 2.35s
Episode 193 | Total Reward: -564.98 | Avg(10): -375.34 | Epsilon: 0.380 | Time: 2.39s
Episode 194 | Total Reward: -381.02 | Avg(10): -388.85 | Epsilon: 0.378 | Time: 2.32s
Episode 195 | Total Reward: -640.95 | Avg(10): -433.16 | Epsilon: 0.376 | Time: 2.37s
Episode 196 | Total Reward: -501.31 | Avg(10): -421.52 | Epsilon: 0.374 | Time: 2.39s
Episode 197 | Total Reward: -331.96 | Avg(10): -429.51 | Epsilon: 0.373 | Time: 3.12s
Episode 198 | Total Reward: -130.47 | Avg(10): -393.24 | Epsilon: 0.371 | Time: 3.79s
Episode 199 | Total Reward: -361.73 | Avg(10): -392.68 | Epsilon: 0.369 | Time: 3.72s
Episode 200 | Total Reward: -619.51 | Avg(10): -429.27 | Epsilon: 0.367 | Time: 3.89s
Best average reward over 10 episodes: -375.34
Best model weights saved to: dqn_pendulum_11actions_weights.h5
Total training time: 559.49s
Test Episode 1: Total Reward = -195.27
Test Episode 2: Total Reward = -253.84
Test Episode 3: Total Reward = -374.48
Test Episode 4: Total Reward = -374.40
Test Episode 5: Total Reward = -748.61
Test Episode 6: Total Reward = -129.91
Test Episode 7: Total Reward = -493.26
Test Episode 8: Total Reward = -253.90
Test Episode 9: Total Reward = -611.68
Test Episode 10: Total Reward = -249.00

Average Reward over 10 episodes: -368.44 ± 186.21
Saved best episode GIF to dqn_pendulum_11actions_eval_best.gif
Saved worst episode GIF to dqn_pendulum_11actions_eval_worst.gif
Saved average episode GIF to dqn_pendulum_11actions_eval_average.gif
============================================================
============================================================
Running experiment with N_ACTIONS = 21

Model Summary:
Model: "dqn_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_12 (Dense)            multiple                  256       
                                                                 
 dense_13 (Dense)            multiple                  4160      
                                                                 
 dense_14 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1286.70 | Avg(10): -1286.70 | Epsilon: 0.995 | Time: 0.11s
Episode 2 | Total Reward: -964.67 | Avg(10): -1125.69 | Epsilon: 0.990 | Time: 0.08s
Episode 3 | Total Reward: -1281.09 | Avg(10): -1177.49 | Epsilon: 0.985 | Time: 0.05s
Episode 4 | Total Reward: -878.27 | Avg(10): -1102.68 | Epsilon: 0.980 | Time: 0.05s
Episode 5 | Total Reward: -1070.12 | Avg(10): -1096.17 | Epsilon: 0.975 | Time: 0.08s
Episode 6 | Total Reward: -1320.43 | Avg(10): -1133.55 | Epsilon: 0.970 | Time: 4.97s
Episode 7 | Total Reward: -955.58 | Avg(10): -1108.12 | Epsilon: 0.966 | Time: 4.91s
Episode 8 | Total Reward: -1245.01 | Avg(10): -1125.23 | Epsilon: 0.961 | Time: 5.19s
Episode 9 | Total Reward: -1432.08 | Avg(10): -1159.33 | Epsilon: 0.956 | Time: 5.31s
Episode 10 | Total Reward: -1235.38 | Avg(10): -1166.93 | Epsilon: 0.951 | Time: 5.20s
Episode 11 | Total Reward: -1803.54 | Avg(10): -1218.62 | Epsilon: 0.946 | Time: 5.25s
Episode 12 | Total Reward: -1328.66 | Avg(10): -1255.02 | Epsilon: 0.942 | Time: 5.05s
Episode 13 | Total Reward: -1299.79 | Avg(10): -1256.89 | Epsilon: 0.937 | Time: 5.28s
Episode 14 | Total Reward: -1417.75 | Avg(10): -1310.83 | Epsilon: 0.932 | Time: 5.16s
Episode 15 | Total Reward: -1800.36 | Avg(10): -1383.86 | Epsilon: 0.928 | Time: 4.99s
Episode 16 | Total Reward: -1321.06 | Avg(10): -1383.92 | Epsilon: 0.923 | Time: 4.90s
Episode 17 | Total Reward: -1271.43 | Avg(10): -1415.51 | Epsilon: 0.918 | Time: 5.08s
Episode 18 | Total Reward: -1763.12 | Avg(10): -1467.32 | Epsilon: 0.914 | Time: 4.80s
Episode 19 | Total Reward: -978.06 | Avg(10): -1421.92 | Epsilon: 0.909 | Time: 4.86s
Episode 20 | Total Reward: -1677.88 | Avg(10): -1466.16 | Epsilon: 0.905 | Time: 6.07s
Episode 21 | Total Reward: -857.83 | Avg(10): -1371.59 | Epsilon: 0.900 | Time: 2.28s
Episode 22 | Total Reward: -1787.72 | Avg(10): -1417.50 | Epsilon: 0.896 | Time: 2.35s
Episode 23 | Total Reward: -1489.98 | Avg(10): -1436.52 | Epsilon: 0.891 | Time: 2.12s
Episode 24 | Total Reward: -1147.63 | Avg(10): -1409.51 | Epsilon: 0.887 | Time: 2.88s
Episode 25 | Total Reward: -1810.82 | Avg(10): -1410.55 | Epsilon: 0.882 | Time: 2.71s
Episode 26 | Total Reward: -1292.71 | Avg(10): -1407.72 | Epsilon: 0.878 | Time: 2.45s
Episode 27 | Total Reward: -1625.98 | Avg(10): -1443.17 | Epsilon: 0.873 | Time: 2.38s
Episode 28 | Total Reward: -1108.36 | Avg(10): -1377.70 | Epsilon: 0.869 | Time: 2.31s
Episode 29 | Total Reward: -1294.51 | Avg(10): -1409.34 | Epsilon: 0.865 | Time: 2.69s
Episode 30 | Total Reward: -1593.13 | Avg(10): -1400.87 | Epsilon: 0.860 | Time: 2.43s
Episode 31 | Total Reward: -1035.88 | Avg(10): -1418.67 | Epsilon: 0.856 | Time: 2.60s
Episode 32 | Total Reward: -955.98 | Avg(10): -1335.50 | Epsilon: 0.852 | Time: 2.76s
Episode 33 | Total Reward: -1532.93 | Avg(10): -1339.79 | Epsilon: 0.848 | Time: 2.15s
Episode 34 | Total Reward: -1116.00 | Avg(10): -1336.63 | Epsilon: 0.843 | Time: 2.07s
Episode 35 | Total Reward: -1370.26 | Avg(10): -1292.58 | Epsilon: 0.839 | Time: 2.06s
Episode 36 | Total Reward: -964.81 | Avg(10): -1259.79 | Epsilon: 0.835 | Time: 2.05s
Episode 37 | Total Reward: -957.40 | Avg(10): -1192.93 | Epsilon: 0.831 | Time: 2.18s
Episode 38 | Total Reward: -1378.46 | Avg(10): -1219.94 | Epsilon: 0.827 | Time: 2.16s
Episode 39 | Total Reward: -1486.25 | Avg(10): -1239.11 | Epsilon: 0.822 | Time: 2.18s
Episode 40 | Total Reward: -1084.76 | Avg(10): -1188.27 | Epsilon: 0.818 | Time: 3.53s
Episode 41 | Total Reward: -1313.33 | Avg(10): -1216.02 | Epsilon: 0.814 | Time: 2.04s
Episode 42 | Total Reward: -1380.74 | Avg(10): -1258.49 | Epsilon: 0.810 | Time: 2.10s
Episode 43 | Total Reward: -895.93 | Avg(10): -1194.79 | Epsilon: 0.806 | Time: 2.07s
Episode 44 | Total Reward: -800.83 | Avg(10): -1163.28 | Epsilon: 0.802 | Time: 2.52s
Episode 45 | Total Reward: -1416.30 | Avg(10): -1167.88 | Epsilon: 0.798 | Time: 2.42s
Episode 46 | Total Reward: -886.37 | Avg(10): -1160.04 | Epsilon: 0.794 | Time: 2.23s
Episode 47 | Total Reward: -987.34 | Avg(10): -1163.03 | Epsilon: 0.790 | Time: 2.35s
Episode 48 | Total Reward: -864.74 | Avg(10): -1111.66 | Epsilon: 0.786 | Time: 2.41s
Episode 49 | Total Reward: -1297.41 | Avg(10): -1092.77 | Epsilon: 0.782 | Time: 2.42s
Episode 50 | Total Reward: -753.10 | Avg(10): -1059.61 | Epsilon: 0.778 | Time: 2.29s
Episode 51 | Total Reward: -901.71 | Avg(10): -1018.45 | Epsilon: 0.774 | Time: 2.17s
Episode 52 | Total Reward: -984.79 | Avg(10): -978.85 | Epsilon: 0.771 | Time: 2.16s
Episode 53 | Total Reward: -754.64 | Avg(10): -964.72 | Epsilon: 0.767 | Time: 2.14s
Episode 54 | Total Reward: -934.89 | Avg(10): -978.13 | Epsilon: 0.763 | Time: 2.30s
Episode 55 | Total Reward: -967.58 | Avg(10): -933.26 | Epsilon: 0.759 | Time: 2.31s
Episode 56 | Total Reward: -1026.56 | Avg(10): -947.28 | Epsilon: 0.755 | Time: 2.10s
Episode 57 | Total Reward: -1064.58 | Avg(10): -955.00 | Epsilon: 0.751 | Time: 2.10s
Episode 58 | Total Reward: -1305.95 | Avg(10): -999.12 | Epsilon: 0.748 | Time: 2.07s
Episode 59 | Total Reward: -1033.33 | Avg(10): -972.71 | Epsilon: 0.744 | Time: 2.06s
Episode 60 | Total Reward: -969.83 | Avg(10): -994.38 | Epsilon: 0.740 | Time: 3.67s
Episode 61 | Total Reward: -1201.77 | Avg(10): -1024.39 | Epsilon: 0.737 | Time: 2.28s
Episode 62 | Total Reward: -1415.87 | Avg(10): -1067.50 | Epsilon: 0.733 | Time: 2.14s
Episode 63 | Total Reward: -1134.95 | Avg(10): -1105.53 | Epsilon: 0.729 | Time: 2.15s
Episode 64 | Total Reward: -919.91 | Avg(10): -1104.03 | Epsilon: 0.726 | Time: 2.09s
Episode 65 | Total Reward: -1131.08 | Avg(10): -1120.38 | Epsilon: 0.722 | Time: 2.15s
Episode 66 | Total Reward: -874.24 | Avg(10): -1105.15 | Epsilon: 0.718 | Time: 2.04s
Episode 67 | Total Reward: -991.11 | Avg(10): -1097.80 | Epsilon: 0.715 | Time: 2.05s
Episode 68 | Total Reward: -873.47 | Avg(10): -1054.56 | Epsilon: 0.711 | Time: 2.05s
Episode 69 | Total Reward: -980.58 | Avg(10): -1049.28 | Epsilon: 0.708 | Time: 2.18s
Episode 70 | Total Reward: -876.34 | Avg(10): -1039.93 | Epsilon: 0.704 | Time: 2.18s
Episode 71 | Total Reward: -1228.35 | Avg(10): -1042.59 | Epsilon: 0.701 | Time: 2.14s
Episode 72 | Total Reward: -1105.84 | Avg(10): -1011.59 | Epsilon: 0.697 | Time: 2.08s
Episode 73 | Total Reward: -1080.75 | Avg(10): -1006.17 | Epsilon: 0.694 | Time: 2.10s
Episode 74 | Total Reward: -1215.68 | Avg(10): -1035.74 | Epsilon: 0.690 | Time: 2.11s
Episode 75 | Total Reward: -1256.23 | Avg(10): -1048.26 | Epsilon: 0.687 | Time: 2.26s
Episode 76 | Total Reward: -1001.24 | Avg(10): -1060.96 | Epsilon: 0.683 | Time: 2.13s
Episode 77 | Total Reward: -1049.44 | Avg(10): -1066.79 | Epsilon: 0.680 | Time: 2.12s
Episode 78 | Total Reward: -838.22 | Avg(10): -1063.27 | Epsilon: 0.676 | Time: 2.17s
Episode 79 | Total Reward: -1093.97 | Avg(10): -1074.61 | Epsilon: 0.673 | Time: 2.11s
Episode 80 | Total Reward: -1304.49 | Avg(10): -1117.42 | Epsilon: 0.670 | Time: 3.54s
Episode 81 | Total Reward: -908.15 | Avg(10): -1085.40 | Epsilon: 0.666 | Time: 2.10s
Episode 82 | Total Reward: -1039.08 | Avg(10): -1078.72 | Epsilon: 0.663 | Time: 2.10s
Episode 83 | Total Reward: -1051.22 | Avg(10): -1075.77 | Epsilon: 0.660 | Time: 2.09s
Episode 84 | Total Reward: -1045.05 | Avg(10): -1058.71 | Epsilon: 0.656 | Time: 2.09s
Episode 85 | Total Reward: -878.61 | Avg(10): -1020.95 | Epsilon: 0.653 | Time: 2.14s
Episode 86 | Total Reward: -1105.20 | Avg(10): -1031.34 | Epsilon: 0.650 | Time: 2.19s
Episode 87 | Total Reward: -988.64 | Avg(10): -1025.26 | Epsilon: 0.647 | Time: 2.17s
Episode 88 | Total Reward: -952.08 | Avg(10): -1036.65 | Epsilon: 0.643 | Time: 2.11s
Episode 89 | Total Reward: -1110.92 | Avg(10): -1038.34 | Epsilon: 0.640 | Time: 2.14s
Episode 90 | Total Reward: -946.62 | Avg(10): -1002.56 | Epsilon: 0.637 | Time: 2.08s
Episode 91 | Total Reward: -804.10 | Avg(10): -992.15 | Epsilon: 0.634 | Time: 2.06s
Episode 92 | Total Reward: -1162.06 | Avg(10): -1004.45 | Epsilon: 0.631 | Time: 2.12s
Episode 93 | Total Reward: -1032.99 | Avg(10): -1002.63 | Epsilon: 0.627 | Time: 2.11s
Episode 94 | Total Reward: -882.16 | Avg(10): -986.34 | Epsilon: 0.624 | Time: 2.15s
Episode 95 | Total Reward: -1015.26 | Avg(10): -1000.00 | Epsilon: 0.621 | Time: 2.20s
Episode 96 | Total Reward: -1129.39 | Avg(10): -1002.42 | Epsilon: 0.618 | Time: 2.38s
Episode 97 | Total Reward: -982.43 | Avg(10): -1001.80 | Epsilon: 0.615 | Time: 2.25s
Episode 98 | Total Reward: -913.79 | Avg(10): -997.97 | Epsilon: 0.612 | Time: 2.24s
Episode 99 | Total Reward: -740.60 | Avg(10): -960.94 | Epsilon: 0.609 | Time: 2.46s
Episode 100 | Total Reward: -1092.42 | Avg(10): -975.52 | Epsilon: 0.606 | Time: 3.69s
Episode 101 | Total Reward: -911.82 | Avg(10): -986.29 | Epsilon: 0.603 | Time: 2.27s
Episode 102 | Total Reward: -891.27 | Avg(10): -959.21 | Epsilon: 0.600 | Time: 2.17s
Episode 103 | Total Reward: -910.97 | Avg(10): -947.01 | Epsilon: 0.597 | Time: 2.11s
Episode 104 | Total Reward: -999.69 | Avg(10): -958.76 | Epsilon: 0.594 | Time: 2.11s
Episode 105 | Total Reward: -1107.91 | Avg(10): -968.03 | Epsilon: 0.591 | Time: 2.20s
Episode 106 | Total Reward: -990.16 | Avg(10): -954.11 | Epsilon: 0.588 | Time: 2.29s
Episode 107 | Total Reward: -770.82 | Avg(10): -932.94 | Epsilon: 0.585 | Time: 2.27s
Episode 108 | Total Reward: -923.86 | Avg(10): -933.95 | Epsilon: 0.582 | Time: 2.23s
Episode 109 | Total Reward: -959.04 | Avg(10): -955.80 | Epsilon: 0.579 | Time: 2.18s
Episode 110 | Total Reward: -763.70 | Avg(10): -922.92 | Epsilon: 0.576 | Time: 2.15s
Episode 111 | Total Reward: -510.28 | Avg(10): -882.77 | Epsilon: 0.573 | Time: 2.12s
Episode 112 | Total Reward: -610.07 | Avg(10): -854.65 | Epsilon: 0.570 | Time: 2.11s
Episode 113 | Total Reward: -749.74 | Avg(10): -838.53 | Epsilon: 0.568 | Time: 2.43s
Episode 114 | Total Reward: -510.79 | Avg(10): -789.64 | Epsilon: 0.565 | Time: 2.57s
Episode 115 | Total Reward: -619.50 | Avg(10): -740.79 | Epsilon: 0.562 | Time: 2.76s
Episode 116 | Total Reward: -860.79 | Avg(10): -727.86 | Epsilon: 0.559 | Time: 2.53s
Episode 117 | Total Reward: -601.94 | Avg(10): -710.97 | Epsilon: 0.556 | Time: 2.96s
Episode 118 | Total Reward: -890.87 | Avg(10): -707.67 | Epsilon: 0.554 | Time: 2.89s
Episode 119 | Total Reward: -621.60 | Avg(10): -673.93 | Epsilon: 0.551 | Time: 2.65s
Episode 120 | Total Reward: -1100.46 | Avg(10): -707.60 | Epsilon: 0.548 | Time: 4.24s
Episode 121 | Total Reward: -618.61 | Avg(10): -718.44 | Epsilon: 0.545 | Time: 2.65s
Episode 122 | Total Reward: -892.07 | Avg(10): -746.64 | Epsilon: 0.543 | Time: 2.72s
Episode 123 | Total Reward: -620.37 | Avg(10): -733.70 | Epsilon: 0.540 | Time: 2.66s
Episode 124 | Total Reward: -854.63 | Avg(10): -768.08 | Epsilon: 0.537 | Time: 2.68s
Episode 125 | Total Reward: -510.65 | Avg(10): -757.20 | Epsilon: 0.534 | Time: 3.31s
Episode 126 | Total Reward: -746.17 | Avg(10): -745.74 | Epsilon: 0.532 | Time: 3.02s
Episode 127 | Total Reward: -501.36 | Avg(10): -735.68 | Epsilon: 0.529 | Time: 2.61s
Episode 128 | Total Reward: -765.32 | Avg(10): -723.12 | Epsilon: 0.526 | Time: 3.18s
Episode 129 | Total Reward: -854.16 | Avg(10): -746.38 | Epsilon: 0.524 | Time: 3.17s
Episode 130 | Total Reward: -618.82 | Avg(10): -698.21 | Epsilon: 0.521 | Time: 3.09s
Episode 131 | Total Reward: -630.08 | Avg(10): -699.36 | Epsilon: 0.519 | Time: 3.24s
Episode 132 | Total Reward: -514.31 | Avg(10): -661.59 | Epsilon: 0.516 | Time: 2.86s
Episode 133 | Total Reward: -640.25 | Avg(10): -663.57 | Epsilon: 0.513 | Time: 3.46s
Episode 134 | Total Reward: -848.56 | Avg(10): -662.97 | Epsilon: 0.511 | Time: 2.71s
Episode 135 | Total Reward: -844.91 | Avg(10): -696.39 | Epsilon: 0.508 | Time: 2.52s
Episode 136 | Total Reward: -637.58 | Avg(10): -685.53 | Epsilon: 0.506 | Time: 2.29s
Episode 137 | Total Reward: -638.44 | Avg(10): -699.24 | Epsilon: 0.503 | Time: 2.71s
Episode 138 | Total Reward: -642.61 | Avg(10): -686.97 | Epsilon: 0.501 | Time: 3.97s
Episode 139 | Total Reward: -367.66 | Avg(10): -638.32 | Epsilon: 0.498 | Time: 2.91s
Episode 140 | Total Reward: -509.96 | Avg(10): -627.44 | Epsilon: 0.496 | Time: 4.63s
Episode 141 | Total Reward: -377.79 | Avg(10): -602.21 | Epsilon: 0.493 | Time: 2.97s
Episode 142 | Total Reward: -375.39 | Avg(10): -588.31 | Epsilon: 0.491 | Time: 2.86s
Episode 143 | Total Reward: -543.03 | Avg(10): -578.59 | Epsilon: 0.488 | Time: 2.28s
Episode 144 | Total Reward: -375.61 | Avg(10): -531.30 | Epsilon: 0.486 | Time: 2.82s
Episode 145 | Total Reward: -508.89 | Avg(10): -497.70 | Epsilon: 0.483 | Time: 3.19s
Episode 146 | Total Reward: -968.29 | Avg(10): -530.77 | Epsilon: 0.481 | Time: 3.57s
Episode 147 | Total Reward: -253.45 | Avg(10): -492.27 | Epsilon: 0.479 | Time: 3.86s
Episode 148 | Total Reward: -651.21 | Avg(10): -493.13 | Epsilon: 0.476 | Time: 2.87s
Episode 149 | Total Reward: -727.47 | Avg(10): -529.11 | Epsilon: 0.474 | Time: 2.97s
Episode 150 | Total Reward: -503.48 | Avg(10): -528.46 | Epsilon: 0.471 | Time: 2.75s
Episode 151 | Total Reward: -514.68 | Avg(10): -542.15 | Epsilon: 0.469 | Time: 2.39s
Episode 152 | Total Reward: -491.94 | Avg(10): -553.80 | Epsilon: 0.467 | Time: 5.08s
Episode 153 | Total Reward: -846.06 | Avg(10): -584.11 | Epsilon: 0.464 | Time: 5.50s
Episode 154 | Total Reward: -492.14 | Avg(10): -595.76 | Epsilon: 0.462 | Time: 5.42s
Episode 155 | Total Reward: -515.11 | Avg(10): -596.38 | Epsilon: 0.460 | Time: 6.47s
Episode 156 | Total Reward: -752.29 | Avg(10): -574.78 | Epsilon: 0.458 | Time: 5.53s
Episode 157 | Total Reward: -507.25 | Avg(10): -600.16 | Epsilon: 0.455 | Time: 5.72s
Episode 158 | Total Reward: -270.27 | Avg(10): -562.07 | Epsilon: 0.453 | Time: 5.75s
Episode 159 | Total Reward: -600.36 | Avg(10): -549.36 | Epsilon: 0.451 | Time: 5.92s
Episode 160 | Total Reward: -128.06 | Avg(10): -511.82 | Epsilon: 0.448 | Time: 8.09s
Episode 161 | Total Reward: -768.45 | Avg(10): -537.19 | Epsilon: 0.446 | Time: 6.11s
Episode 162 | Total Reward: -489.31 | Avg(10): -536.93 | Epsilon: 0.444 | Time: 3.21s
Episode 163 | Total Reward: -513.47 | Avg(10): -503.67 | Epsilon: 0.442 | Time: 2.48s
Episode 164 | Total Reward: -500.01 | Avg(10): -504.46 | Epsilon: 0.440 | Time: 2.36s
Episode 165 | Total Reward: -678.65 | Avg(10): -520.81 | Epsilon: 0.437 | Time: 2.38s
Episode 166 | Total Reward: -477.74 | Avg(10): -493.36 | Epsilon: 0.435 | Time: 2.56s
Episode 167 | Total Reward: -378.79 | Avg(10): -480.51 | Epsilon: 0.433 | Time: 2.65s
Episode 168 | Total Reward: -621.96 | Avg(10): -515.68 | Epsilon: 0.431 | Time: 2.47s
Episode 169 | Total Reward: -487.24 | Avg(10): -504.37 | Epsilon: 0.429 | Time: 2.82s
Episode 170 | Total Reward: -637.15 | Avg(10): -555.28 | Epsilon: 0.427 | Time: 2.42s
Episode 171 | Total Reward: -251.69 | Avg(10): -503.60 | Epsilon: 0.424 | Time: 2.28s
Episode 172 | Total Reward: -252.33 | Avg(10): -479.90 | Epsilon: 0.422 | Time: 2.71s
Episode 173 | Total Reward: -378.31 | Avg(10): -466.39 | Epsilon: 0.420 | Time: 2.34s
Episode 174 | Total Reward: -372.77 | Avg(10): -453.66 | Epsilon: 0.418 | Time: 2.83s
Episode 175 | Total Reward: -374.59 | Avg(10): -423.26 | Epsilon: 0.416 | Time: 2.62s
Episode 176 | Total Reward: -753.63 | Avg(10): -450.85 | Epsilon: 0.414 | Time: 3.88s
Episode 177 | Total Reward: -868.84 | Avg(10): -499.85 | Epsilon: 0.412 | Time: 3.19s
Episode 178 | Total Reward: -762.50 | Avg(10): -513.90 | Epsilon: 0.410 | Time: 2.51s
Episode 179 | Total Reward: -826.12 | Avg(10): -547.79 | Epsilon: 0.408 | Time: 2.57s
Episode 180 | Total Reward: -938.18 | Avg(10): -577.90 | Epsilon: 0.406 | Time: 4.64s
Episode 181 | Total Reward: -628.61 | Avg(10): -615.59 | Epsilon: 0.404 | Time: 2.67s
Episode 182 | Total Reward: -125.32 | Avg(10): -602.89 | Epsilon: 0.402 | Time: 2.62s
Episode 183 | Total Reward: -252.21 | Avg(10): -590.28 | Epsilon: 0.400 | Time: 3.23s
Episode 184 | Total Reward: -251.88 | Avg(10): -578.19 | Epsilon: 0.398 | Time: 3.29s
Episode 185 | Total Reward: -363.73 | Avg(10): -577.10 | Epsilon: 0.396 | Time: 2.71s
Episode 186 | Total Reward: -471.17 | Avg(10): -548.86 | Epsilon: 0.394 | Time: 2.52s
Episode 187 | Total Reward: -712.37 | Avg(10): -533.21 | Epsilon: 0.392 | Time: 2.52s
Episode 188 | Total Reward: -628.78 | Avg(10): -519.84 | Epsilon: 0.390 | Time: 2.38s
Episode 189 | Total Reward: -245.55 | Avg(10): -461.78 | Epsilon: 0.388 | Time: 2.38s
Episode 190 | Total Reward: -128.01 | Avg(10): -380.76 | Epsilon: 0.386 | Time: 2.37s
Episode 191 | Total Reward: -609.19 | Avg(10): -378.82 | Epsilon: 0.384 | Time: 2.42s
Episode 192 | Total Reward: -453.44 | Avg(10): -411.63 | Epsilon: 0.382 | Time: 2.50s
Episode 193 | Total Reward: -253.76 | Avg(10): -411.79 | Epsilon: 0.380 | Time: 2.48s
Episode 194 | Total Reward: -282.79 | Avg(10): -414.88 | Epsilon: 0.378 | Time: 2.50s
Episode 195 | Total Reward: -769.50 | Avg(10): -455.45 | Epsilon: 0.376 | Time: 2.46s
Episode 196 | Total Reward: -376.60 | Avg(10): -446.00 | Epsilon: 0.374 | Time: 2.41s
Episode 197 | Total Reward: -509.90 | Avg(10): -425.75 | Epsilon: 0.373 | Time: 2.38s
Episode 198 | Total Reward: -127.08 | Avg(10): -375.58 | Epsilon: 0.371 | Time: 2.38s
Episode 199 | Total Reward: -126.77 | Avg(10): -363.70 | Epsilon: 0.369 | Time: 2.37s
Episode 200 | Total Reward: -402.20 | Avg(10): -391.12 | Epsilon: 0.367 | Time: 3.90s
Best average reward over 10 episodes: -363.70
Best model weights saved to: dqn_pendulum_21actions_weights.h5
Total training time: 568.87s
Test Episode 1: Total Reward = -607.68
Test Episode 2: Total Reward = -494.75
Test Episode 3: Total Reward = -628.61
Test Episode 4: Total Reward = -367.17
Test Episode 5: Total Reward = -361.16
Test Episode 6: Total Reward = -273.38
Test Episode 7: Total Reward = -862.95
Test Episode 8: Total Reward = -377.29
Test Episode 9: Total Reward = -363.28
Test Episode 10: Total Reward = -372.20

Average Reward over 10 episodes: -470.85 ± 170.22
Saved best episode GIF to dqn_pendulum_21actions_eval_best.gif
Saved worst episode GIF to dqn_pendulum_21actions_eval_worst.gif
Saved average episode GIF to dqn_pendulum_21actions_eval_average.gif
============================================================
============================================================
Running experiment with N_ACTIONS = 50

Model Summary:
Model: "dqn_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_18 (Dense)            multiple                  256       
                                                                 
 dense_19 (Dense)            multiple                  4160      
                                                                 
 dense_20 (Dense)            multiple                  3250      
                                                                 
=================================================================
Total params: 7666 (29.95 KB)
Trainable params: 7666 (29.95 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1070.38 | Avg(10): -1070.38 | Epsilon: 0.995 | Time: 0.07s
Episode 2 | Total Reward: -1056.86 | Avg(10): -1063.62 | Epsilon: 0.990 | Time: 0.02s
Episode 3 | Total Reward: -862.34 | Avg(10): -996.53 | Epsilon: 0.985 | Time: 0.02s
Episode 4 | Total Reward: -1531.54 | Avg(10): -1130.28 | Epsilon: 0.980 | Time: 0.02s
Episode 5 | Total Reward: -1380.07 | Avg(10): -1180.24 | Epsilon: 0.975 | Time: 0.08s
Episode 6 | Total Reward: -1158.77 | Avg(10): -1176.66 | Epsilon: 0.970 | Time: 5.75s
Episode 7 | Total Reward: -1171.95 | Avg(10): -1175.99 | Epsilon: 0.966 | Time: 6.11s
Episode 8 | Total Reward: -900.13 | Avg(10): -1141.50 | Epsilon: 0.961 | Time: 5.74s
Episode 9 | Total Reward: -1711.45 | Avg(10): -1204.83 | Epsilon: 0.956 | Time: 6.28s
Episode 10 | Total Reward: -1748.40 | Avg(10): -1259.19 | Epsilon: 0.951 | Time: 10.49s
Episode 11 | Total Reward: -1800.79 | Avg(10): -1332.23 | Epsilon: 0.946 | Time: 8.29s
Episode 12 | Total Reward: -1841.74 | Avg(10): -1410.72 | Epsilon: 0.942 | Time: 8.99s
Episode 13 | Total Reward: -1712.59 | Avg(10): -1495.74 | Epsilon: 0.937 | Time: 17.05s
Episode 14 | Total Reward: -1467.34 | Avg(10): -1489.32 | Epsilon: 0.932 | Time: 10.89s
Episode 15 | Total Reward: -969.06 | Avg(10): -1448.22 | Epsilon: 0.928 | Time: 9.70s
Episode 16 | Total Reward: -1410.54 | Avg(10): -1473.40 | Epsilon: 0.923 | Time: 9.37s
Episode 17 | Total Reward: -1662.15 | Avg(10): -1522.42 | Epsilon: 0.918 | Time: 11.13s
Episode 18 | Total Reward: -990.44 | Avg(10): -1531.45 | Epsilon: 0.914 | Time: 5.14s
Episode 19 | Total Reward: -1712.99 | Avg(10): -1531.60 | Epsilon: 0.909 | Time: 5.44s
Episode 20 | Total Reward: -1441.18 | Avg(10): -1500.88 | Epsilon: 0.905 | Time: 6.36s
Episode 21 | Total Reward: -1393.54 | Avg(10): -1460.16 | Epsilon: 0.900 | Time: 2.69s
Episode 22 | Total Reward: -1134.92 | Avg(10): -1389.47 | Epsilon: 0.896 | Time: 2.78s
Episode 23 | Total Reward: -1405.91 | Avg(10): -1358.81 | Epsilon: 0.891 | Time: 3.07s
Episode 24 | Total Reward: -1234.93 | Avg(10): -1335.57 | Epsilon: 0.887 | Time: 2.83s
Episode 25 | Total Reward: -858.66 | Avg(10): -1324.53 | Epsilon: 0.882 | Time: 3.17s
Episode 26 | Total Reward: -1052.34 | Avg(10): -1288.71 | Epsilon: 0.878 | Time: 3.02s
Episode 27 | Total Reward: -1532.49 | Avg(10): -1275.74 | Epsilon: 0.873 | Time: 2.74s
Episode 28 | Total Reward: -1528.40 | Avg(10): -1329.54 | Epsilon: 0.869 | Time: 2.80s
Episode 29 | Total Reward: -1433.38 | Avg(10): -1301.58 | Epsilon: 0.865 | Time: 2.94s
Episode 30 | Total Reward: -844.79 | Avg(10): -1241.94 | Epsilon: 0.860 | Time: 2.94s
Episode 31 | Total Reward: -1296.03 | Avg(10): -1232.19 | Epsilon: 0.856 | Time: 2.91s
Episode 32 | Total Reward: -1153.18 | Avg(10): -1234.01 | Epsilon: 0.852 | Time: 2.84s
Episode 33 | Total Reward: -1563.43 | Avg(10): -1249.76 | Epsilon: 0.848 | Time: 2.80s
Episode 34 | Total Reward: -982.16 | Avg(10): -1224.49 | Epsilon: 0.843 | Time: 3.04s
Episode 35 | Total Reward: -1073.16 | Avg(10): -1245.94 | Epsilon: 0.839 | Time: 2.79s
Episode 36 | Total Reward: -1309.84 | Avg(10): -1271.68 | Epsilon: 0.835 | Time: 2.99s
Episode 37 | Total Reward: -898.53 | Avg(10): -1208.29 | Epsilon: 0.831 | Time: 2.94s
Episode 38 | Total Reward: -894.91 | Avg(10): -1144.94 | Epsilon: 0.827 | Time: 2.85s
Episode 39 | Total Reward: -1003.44 | Avg(10): -1101.95 | Epsilon: 0.822 | Time: 2.72s
Episode 40 | Total Reward: -1421.28 | Avg(10): -1159.60 | Epsilon: 0.818 | Time: 4.70s
Episode 41 | Total Reward: -880.11 | Avg(10): -1118.00 | Epsilon: 0.814 | Time: 2.95s
Episode 42 | Total Reward: -878.42 | Avg(10): -1090.53 | Epsilon: 0.810 | Time: 3.07s
Episode 43 | Total Reward: -1043.52 | Avg(10): -1038.54 | Epsilon: 0.806 | Time: 3.01s
Episode 44 | Total Reward: -1501.24 | Avg(10): -1090.45 | Epsilon: 0.802 | Time: 2.80s
Episode 45 | Total Reward: -962.77 | Avg(10): -1079.41 | Epsilon: 0.798 | Time: 2.90s
Episode 46 | Total Reward: -1544.61 | Avg(10): -1102.88 | Epsilon: 0.794 | Time: 2.77s
Episode 47 | Total Reward: -1536.10 | Avg(10): -1166.64 | Epsilon: 0.790 | Time: 3.01s
Episode 48 | Total Reward: -909.07 | Avg(10): -1168.06 | Epsilon: 0.786 | Time: 2.71s
Episode 49 | Total Reward: -864.49 | Avg(10): -1154.16 | Epsilon: 0.782 | Time: 2.82s
Episode 50 | Total Reward: -1011.79 | Avg(10): -1113.21 | Epsilon: 0.778 | Time: 2.78s
Episode 51 | Total Reward: -1531.23 | Avg(10): -1178.32 | Epsilon: 0.774 | Time: 2.91s
Episode 52 | Total Reward: -1510.48 | Avg(10): -1241.53 | Epsilon: 0.771 | Time: 2.96s
Episode 53 | Total Reward: -890.17 | Avg(10): -1226.19 | Epsilon: 0.767 | Time: 3.06s
Episode 54 | Total Reward: -1439.76 | Avg(10): -1220.05 | Epsilon: 0.763 | Time: 2.91s
Episode 55 | Total Reward: -1207.64 | Avg(10): -1244.53 | Epsilon: 0.759 | Time: 2.89s
Episode 56 | Total Reward: -1189.49 | Avg(10): -1209.02 | Epsilon: 0.755 | Time: 3.87s
Episode 57 | Total Reward: -1730.26 | Avg(10): -1228.44 | Epsilon: 0.751 | Time: 3.67s
Episode 58 | Total Reward: -1592.24 | Avg(10): -1296.75 | Epsilon: 0.748 | Time: 2.77s
Episode 59 | Total Reward: -1489.34 | Avg(10): -1359.24 | Epsilon: 0.744 | Time: 2.75s
Episode 60 | Total Reward: -1023.38 | Avg(10): -1360.40 | Epsilon: 0.740 | Time: 4.89s
Episode 61 | Total Reward: -1501.58 | Avg(10): -1357.43 | Epsilon: 0.737 | Time: 4.94s
Episode 62 | Total Reward: -1030.23 | Avg(10): -1309.41 | Epsilon: 0.733 | Time: 3.90s
Episode 63 | Total Reward: -991.25 | Avg(10): -1319.52 | Epsilon: 0.729 | Time: 3.70s
Episode 64 | Total Reward: -854.16 | Avg(10): -1260.96 | Epsilon: 0.726 | Time: 2.91s
Episode 65 | Total Reward: -1055.36 | Avg(10): -1245.73 | Epsilon: 0.722 | Time: 3.03s
Episode 66 | Total Reward: -888.92 | Avg(10): -1215.67 | Epsilon: 0.718 | Time: 3.70s
Episode 67 | Total Reward: -931.50 | Avg(10): -1135.80 | Epsilon: 0.715 | Time: 3.21s
Episode 68 | Total Reward: -986.03 | Avg(10): -1075.18 | Epsilon: 0.711 | Time: 2.83s
Episode 69 | Total Reward: -954.66 | Avg(10): -1021.71 | Epsilon: 0.708 | Time: 3.20s
Episode 70 | Total Reward: -1165.16 | Avg(10): -1035.89 | Epsilon: 0.704 | Time: 3.52s
Episode 71 | Total Reward: -1041.58 | Avg(10): -989.89 | Epsilon: 0.701 | Time: 3.74s
Episode 72 | Total Reward: -1047.19 | Avg(10): -991.58 | Epsilon: 0.697 | Time: 3.39s
Episode 73 | Total Reward: -1060.61 | Avg(10): -998.52 | Epsilon: 0.694 | Time: 6.30s
Episode 74 | Total Reward: -1050.18 | Avg(10): -1018.12 | Epsilon: 0.690 | Time: 5.88s
Episode 75 | Total Reward: -985.77 | Avg(10): -1011.16 | Epsilon: 0.687 | Time: 8.51s
Episode 76 | Total Reward: -1070.45 | Avg(10): -1029.31 | Epsilon: 0.683 | Time: 5.42s
Episode 77 | Total Reward: -1126.28 | Avg(10): -1048.79 | Epsilon: 0.680 | Time: 5.20s
Episode 78 | Total Reward: -1443.35 | Avg(10): -1094.52 | Epsilon: 0.676 | Time: 5.45s
Episode 79 | Total Reward: -1121.53 | Avg(10): -1111.21 | Epsilon: 0.673 | Time: 5.38s
Episode 80 | Total Reward: -1277.92 | Avg(10): -1122.49 | Epsilon: 0.670 | Time: 8.99s
Episode 81 | Total Reward: -1064.31 | Avg(10): -1124.76 | Epsilon: 0.666 | Time: 9.31s
Episode 82 | Total Reward: -1234.47 | Avg(10): -1143.49 | Epsilon: 0.663 | Time: 3.22s
Episode 83 | Total Reward: -1225.37 | Avg(10): -1159.96 | Epsilon: 0.660 | Time: 2.86s
Episode 84 | Total Reward: -1307.60 | Avg(10): -1185.71 | Epsilon: 0.656 | Time: 3.07s
Episode 85 | Total Reward: -1148.60 | Avg(10): -1201.99 | Epsilon: 0.653 | Time: 2.92s
Episode 86 | Total Reward: -1148.15 | Avg(10): -1209.76 | Epsilon: 0.650 | Time: 2.92s
Episode 87 | Total Reward: -929.50 | Avg(10): -1190.08 | Epsilon: 0.647 | Time: 2.90s
Episode 88 | Total Reward: -1137.96 | Avg(10): -1159.54 | Epsilon: 0.643 | Time: 2.85s
Episode 89 | Total Reward: -1143.35 | Avg(10): -1161.72 | Epsilon: 0.640 | Time: 2.83s
Episode 90 | Total Reward: -1012.55 | Avg(10): -1135.19 | Epsilon: 0.637 | Time: 2.85s
Episode 91 | Total Reward: -1183.57 | Avg(10): -1147.11 | Epsilon: 0.634 | Time: 2.84s
Episode 92 | Total Reward: -1050.60 | Avg(10): -1128.73 | Epsilon: 0.631 | Time: 2.92s
Episode 93 | Total Reward: -1059.57 | Avg(10): -1112.15 | Epsilon: 0.627 | Time: 3.03s
Episode 94 | Total Reward: -1207.68 | Avg(10): -1102.15 | Epsilon: 0.624 | Time: 2.91s
Episode 95 | Total Reward: -1152.94 | Avg(10): -1102.59 | Epsilon: 0.621 | Time: 2.87s
Episode 96 | Total Reward: -1410.41 | Avg(10): -1128.81 | Epsilon: 0.618 | Time: 3.06s
Episode 97 | Total Reward: -1408.50 | Avg(10): -1176.71 | Epsilon: 0.615 | Time: 3.15s
Episode 98 | Total Reward: -1057.96 | Avg(10): -1168.71 | Epsilon: 0.612 | Time: 3.11s
Episode 99 | Total Reward: -604.67 | Avg(10): -1114.85 | Epsilon: 0.609 | Time: 2.84s
Episode 100 | Total Reward: -1233.36 | Avg(10): -1136.93 | Epsilon: 0.606 | Time: 4.54s
Episode 101 | Total Reward: -1184.24 | Avg(10): -1136.99 | Epsilon: 0.603 | Time: 2.85s
Episode 102 | Total Reward: -1134.44 | Avg(10): -1145.38 | Epsilon: 0.600 | Time: 3.07s
Episode 103 | Total Reward: -1047.82 | Avg(10): -1144.20 | Epsilon: 0.597 | Time: 2.82s
Episode 104 | Total Reward: -1177.45 | Avg(10): -1141.18 | Epsilon: 0.594 | Time: 3.04s
Episode 105 | Total Reward: -1098.88 | Avg(10): -1135.77 | Epsilon: 0.591 | Time: 2.84s
Episode 106 | Total Reward: -1076.60 | Avg(10): -1102.39 | Epsilon: 0.588 | Time: 3.05s
Episode 107 | Total Reward: -1076.80 | Avg(10): -1069.22 | Epsilon: 0.585 | Time: 2.83s
Episode 108 | Total Reward: -1204.55 | Avg(10): -1083.88 | Epsilon: 0.582 | Time: 2.91s
Episode 109 | Total Reward: -571.98 | Avg(10): -1080.61 | Epsilon: 0.579 | Time: 2.96s
Episode 110 | Total Reward: -1231.98 | Avg(10): -1080.47 | Epsilon: 0.576 | Time: 2.87s
Episode 111 | Total Reward: -1231.46 | Avg(10): -1085.20 | Epsilon: 0.573 | Time: 2.88s
Episode 112 | Total Reward: -1164.61 | Avg(10): -1088.21 | Epsilon: 0.570 | Time: 2.98s
Episode 113 | Total Reward: -1034.03 | Avg(10): -1086.83 | Epsilon: 0.568 | Time: 2.90s
Episode 114 | Total Reward: -879.05 | Avg(10): -1056.99 | Epsilon: 0.565 | Time: 5.02s
Episode 115 | Total Reward: -1084.82 | Avg(10): -1055.59 | Epsilon: 0.562 | Time: 5.60s
Episode 116 | Total Reward: -991.92 | Avg(10): -1047.12 | Epsilon: 0.559 | Time: 6.93s
Episode 117 | Total Reward: -1153.94 | Avg(10): -1054.83 | Epsilon: 0.556 | Time: 5.49s
Episode 118 | Total Reward: -653.99 | Avg(10): -999.78 | Epsilon: 0.554 | Time: 5.64s
Episode 119 | Total Reward: -1179.62 | Avg(10): -1060.54 | Epsilon: 0.551 | Time: 6.43s
Episode 120 | Total Reward: -1138.52 | Avg(10): -1051.20 | Epsilon: 0.548 | Time: 7.24s
Episode 121 | Total Reward: -1000.15 | Avg(10): -1028.07 | Epsilon: 0.545 | Time: 4.25s
Episode 122 | Total Reward: -1284.18 | Avg(10): -1040.02 | Epsilon: 0.543 | Time: 3.75s
Episode 123 | Total Reward: -922.13 | Avg(10): -1028.83 | Epsilon: 0.540 | Time: 6.55s
Episode 124 | Total Reward: -950.58 | Avg(10): -1035.99 | Epsilon: 0.537 | Time: 4.95s
Episode 125 | Total Reward: -1066.21 | Avg(10): -1034.12 | Epsilon: 0.534 | Time: 6.35s
Episode 126 | Total Reward: -888.86 | Avg(10): -1023.82 | Epsilon: 0.532 | Time: 4.29s
Episode 127 | Total Reward: -1144.35 | Avg(10): -1022.86 | Epsilon: 0.529 | Time: 6.52s
Episode 128 | Total Reward: -638.19 | Avg(10): -1021.28 | Epsilon: 0.526 | Time: 3.66s
Episode 129 | Total Reward: -655.37 | Avg(10): -968.85 | Epsilon: 0.524 | Time: 4.76s
Episode 130 | Total Reward: -914.78 | Avg(10): -946.48 | Epsilon: 0.521 | Time: 4.94s
Episode 131 | Total Reward: -642.02 | Avg(10): -910.67 | Epsilon: 0.519 | Time: 5.55s
Episode 132 | Total Reward: -1078.36 | Avg(10): -890.08 | Epsilon: 0.516 | Time: 5.35s
Episode 133 | Total Reward: -1222.38 | Avg(10): -920.11 | Epsilon: 0.513 | Time: 6.49s
Episode 134 | Total Reward: -1062.90 | Avg(10): -931.34 | Epsilon: 0.511 | Time: 5.98s
Episode 135 | Total Reward: -748.01 | Avg(10): -899.52 | Epsilon: 0.508 | Time: 5.70s
Episode 136 | Total Reward: -1022.53 | Avg(10): -912.89 | Epsilon: 0.506 | Time: 5.59s
Episode 137 | Total Reward: -1064.57 | Avg(10): -904.91 | Epsilon: 0.503 | Time: 5.86s
Episode 138 | Total Reward: -771.08 | Avg(10): -918.20 | Epsilon: 0.501 | Time: 6.35s
Episode 139 | Total Reward: -1027.15 | Avg(10): -955.38 | Epsilon: 0.498 | Time: 4.84s
Episode 140 | Total Reward: -976.24 | Avg(10): -961.52 | Epsilon: 0.496 | Time: 4.68s
Episode 141 | Total Reward: -888.35 | Avg(10): -986.16 | Epsilon: 0.493 | Time: 2.83s
Episode 142 | Total Reward: -643.12 | Avg(10): -942.63 | Epsilon: 0.491 | Time: 2.64s
Episode 143 | Total Reward: -630.44 | Avg(10): -883.44 | Epsilon: 0.488 | Time: 3.03s
Episode 144 | Total Reward: -875.56 | Avg(10): -864.71 | Epsilon: 0.486 | Time: 2.92s
Episode 145 | Total Reward: -780.17 | Avg(10): -867.92 | Epsilon: 0.483 | Time: 2.63s
Episode 146 | Total Reward: -800.45 | Avg(10): -845.71 | Epsilon: 0.481 | Time: 2.71s
Episode 147 | Total Reward: -924.68 | Avg(10): -831.72 | Epsilon: 0.479 | Time: 3.20s
Episode 148 | Total Reward: -918.27 | Avg(10): -846.44 | Epsilon: 0.476 | Time: 2.99s
Episode 149 | Total Reward: -887.38 | Avg(10): -832.47 | Epsilon: 0.474 | Time: 2.56s
Episode 150 | Total Reward: -1149.39 | Avg(10): -849.78 | Epsilon: 0.471 | Time: 2.83s
Episode 151 | Total Reward: -762.67 | Avg(10): -837.21 | Epsilon: 0.469 | Time: 3.07s
Episode 152 | Total Reward: -1019.12 | Avg(10): -874.81 | Epsilon: 0.467 | Time: 3.04s
Episode 153 | Total Reward: -999.58 | Avg(10): -911.72 | Epsilon: 0.464 | Time: 2.93s
Episode 154 | Total Reward: -759.46 | Avg(10): -900.11 | Epsilon: 0.462 | Time: 3.04s
Episode 155 | Total Reward: -751.54 | Avg(10): -897.25 | Epsilon: 0.460 | Time: 2.50s
Episode 156 | Total Reward: -765.53 | Avg(10): -893.76 | Epsilon: 0.458 | Time: 2.43s
Episode 157 | Total Reward: -1010.61 | Avg(10): -902.35 | Epsilon: 0.455 | Time: 2.84s
Episode 158 | Total Reward: -778.14 | Avg(10): -888.34 | Epsilon: 0.453 | Time: 2.92s
Episode 159 | Total Reward: -863.19 | Avg(10): -885.92 | Epsilon: 0.451 | Time: 2.53s
Episode 160 | Total Reward: -766.32 | Avg(10): -847.61 | Epsilon: 0.448 | Time: 5.08s
Episode 161 | Total Reward: -1109.48 | Avg(10): -882.30 | Epsilon: 0.446 | Time: 3.00s
Episode 162 | Total Reward: -871.30 | Avg(10): -867.51 | Epsilon: 0.444 | Time: 2.63s
Episode 163 | Total Reward: -1131.09 | Avg(10): -880.67 | Epsilon: 0.442 | Time: 2.28s
Episode 164 | Total Reward: -867.87 | Avg(10): -891.51 | Epsilon: 0.440 | Time: 2.67s
Episode 165 | Total Reward: -728.54 | Avg(10): -889.21 | Epsilon: 0.437 | Time: 2.38s
Episode 166 | Total Reward: -1120.00 | Avg(10): -924.65 | Epsilon: 0.435 | Time: 2.44s
Episode 167 | Total Reward: -1018.77 | Avg(10): -925.47 | Epsilon: 0.433 | Time: 2.78s
Episode 168 | Total Reward: -898.97 | Avg(10): -937.55 | Epsilon: 0.431 | Time: 2.71s
Episode 169 | Total Reward: -863.45 | Avg(10): -937.58 | Epsilon: 0.429 | Time: 3.14s
Episode 170 | Total Reward: -878.86 | Avg(10): -948.83 | Epsilon: 0.427 | Time: 4.03s
Episode 171 | Total Reward: -777.84 | Avg(10): -915.67 | Epsilon: 0.424 | Time: 3.24s
Episode 172 | Total Reward: -748.76 | Avg(10): -903.41 | Epsilon: 0.422 | Time: 3.31s
Episode 173 | Total Reward: -861.72 | Avg(10): -876.48 | Epsilon: 0.420 | Time: 3.14s
Episode 174 | Total Reward: -865.03 | Avg(10): -876.19 | Epsilon: 0.418 | Time: 3.15s
Episode 175 | Total Reward: -749.17 | Avg(10): -878.26 | Epsilon: 0.416 | Time: 2.89s
Episode 176 | Total Reward: -756.34 | Avg(10): -841.89 | Epsilon: 0.414 | Time: 2.51s
Episode 177 | Total Reward: -762.85 | Avg(10): -816.30 | Epsilon: 0.412 | Time: 3.08s
Episode 178 | Total Reward: -630.31 | Avg(10): -789.43 | Epsilon: 0.410 | Time: 3.29s
Episode 179 | Total Reward: -878.51 | Avg(10): -790.94 | Epsilon: 0.408 | Time: 3.14s
Episode 180 | Total Reward: -942.05 | Avg(10): -797.26 | Epsilon: 0.406 | Time: 4.56s
Episode 181 | Total Reward: -758.45 | Avg(10): -795.32 | Epsilon: 0.404 | Time: 3.20s
Episode 182 | Total Reward: -488.50 | Avg(10): -769.29 | Epsilon: 0.402 | Time: 3.09s
Episode 183 | Total Reward: -752.51 | Avg(10): -758.37 | Epsilon: 0.400 | Time: 3.71s
Episode 184 | Total Reward: -554.12 | Avg(10): -727.28 | Epsilon: 0.398 | Time: 3.93s
Episode 185 | Total Reward: -849.88 | Avg(10): -737.35 | Epsilon: 0.396 | Time: 5.12s
Episode 186 | Total Reward: -526.78 | Avg(10): -714.40 | Epsilon: 0.394 | Time: 3.48s
Episode 187 | Total Reward: -754.84 | Avg(10): -713.60 | Epsilon: 0.392 | Time: 3.52s
Episode 188 | Total Reward: -740.19 | Avg(10): -724.58 | Epsilon: 0.390 | Time: 3.65s
Episode 189 | Total Reward: -820.87 | Avg(10): -718.82 | Epsilon: 0.388 | Time: 3.34s
Episode 190 | Total Reward: -932.57 | Avg(10): -717.87 | Epsilon: 0.386 | Time: 3.21s
Episode 191 | Total Reward: -519.20 | Avg(10): -693.95 | Epsilon: 0.384 | Time: 3.43s
Episode 192 | Total Reward: -509.92 | Avg(10): -696.09 | Epsilon: 0.382 | Time: 3.50s
Episode 193 | Total Reward: -499.38 | Avg(10): -670.78 | Epsilon: 0.380 | Time: 4.02s
Episode 194 | Total Reward: -843.72 | Avg(10): -699.74 | Epsilon: 0.378 | Time: 3.69s
Episode 195 | Total Reward: -637.12 | Avg(10): -678.46 | Epsilon: 0.376 | Time: 3.40s
Episode 196 | Total Reward: -856.52 | Avg(10): -711.43 | Epsilon: 0.374 | Time: 4.77s
Episode 197 | Total Reward: -750.87 | Avg(10): -711.04 | Epsilon: 0.373 | Time: 5.78s
Episode 198 | Total Reward: -628.16 | Avg(10): -699.83 | Epsilon: 0.371 | Time: 5.25s
Episode 199 | Total Reward: -509.76 | Avg(10): -668.72 | Epsilon: 0.369 | Time: 5.44s
Episode 200 | Total Reward: -253.60 | Avg(10): -600.82 | Epsilon: 0.367 | Time: 8.44s
Best average reward over 10 episodes: -600.82
Best model weights saved to: dqn_pendulum_50actions_weights.h5
Total training time: 798.84s
Test Episode 1: Total Reward = -367.99
Test Episode 2: Total Reward = -384.47
Test Episode 3: Total Reward = -499.06
Test Episode 4: Total Reward = -357.61
Test Episode 5: Total Reward = -549.81
Test Episode 6: Total Reward = -381.41
Test Episode 7: Total Reward = -651.54
Test Episode 8: Total Reward = -865.41
Test Episode 9: Total Reward = -129.38
Test Episode 10: Total Reward = -245.49

Average Reward over 10 episodes: -443.22 ± 198.49
Saved best episode GIF to dqn_pendulum_50actions_eval_best.gif
Saved worst episode GIF to dqn_pendulum_50actions_eval_worst.gif
Saved average episode GIF to dqn_pendulum_50actions_eval_average.gif
============================================================
In [6]:
def create_individual_comparison_plots():
    n_actions_list = [5, 11, 21, 50]
    
    # 1. Training Progress Comparison
    fig, axs = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('DQN Training Progress Across Different Action Spaces', fontsize=20, y=0.95)
    
    positions = [(0,0), (0,1), (1,0), (1,1)]
    for idx, n_actions in enumerate(n_actions_list):
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        file_path = f"{experiment_prefix}_training_plot.png"
        
        row, col = positions[idx]
        ax = axs[row, col]
        
        if os.path.exists(file_path):
            img = plt.imread(file_path)
            ax.imshow(img)
            ax.set_title(f"N_ACTIONS = {n_actions}", fontsize=16, pad=10)
            ax.axis('off')
    
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    plt.savefig("training_comparison_grid.png", dpi=300, bbox_inches='tight')
    plt.show()

    # 2. Episode Times Comparison
    fig, axs = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Training Time per Episode Across Different Action Spaces', fontsize=20, y=0.95)
    
    for idx, n_actions in enumerate(n_actions_list):
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        file_path = f"{experiment_prefix}_episode_times.png"
        
        row, col = positions[idx]
        ax = axs[row, col]
        
        if os.path.exists(file_path):
            img = plt.imread(file_path)
            ax.imshow(img)
            ax.set_title(f"N_ACTIONS = {n_actions}", fontsize=16, pad=10)
            ax.axis('off')
    
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    plt.savefig("episode_times_comparison_grid.png", dpi=300, bbox_inches='tight')
    plt.show()

    # 3. Evaluation Returns Comparison
    fig, axs = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Final Evaluation Performance Distribution', fontsize=20, y=0.95)
    
    for idx, n_actions in enumerate(n_actions_list):
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        file_path = f"{experiment_prefix}_eval_returns.png"
        
        row, col = positions[idx]
        ax = axs[row, col]
        
        if os.path.exists(file_path):
            img = plt.imread(file_path)
            ax.imshow(img)
            ax.set_title(f"N_ACTIONS = {n_actions}", fontsize=16, pad=10)
            ax.axis('off')
    
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    plt.savefig("evaluation_comparison_grid.png", dpi=300, bbox_inches='tight')
    plt.show()
In [30]:
create_individual_comparison_plots()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Mistake I made below: I used epsilon = 0.05 for evaluation instead of 0¶

  • I will be doing another set of 10 test episodes with epsilon values set = 0

In [7]:
def evaluate_epsilon_zero(experiment_prefix, n_actions, num_episodes=10):
    """Evaluate saved model with epsilon=0 (pure exploitation)"""
    
    # Same parameters as training
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200
    
    # Load the best saved model
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    # Recreate agent
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    
    # Load weights and set epsilon=0
    agent.load(SAVE_WEIGHTS_PATH)
    agent.epsilon = 0.0  # Force pure exploitation
    
    print(f"\nEvaluating {n_actions} actions with epsilon=0.0")
    print(f"Loaded weights: {SAVE_WEIGHTS_PATH}")
    
    env = gym.make('Pendulum-v0')
    rewards = []
    
    for ep in range(num_episodes):
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0
        
        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)  # Now epsilon=0
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            total_reward += r
            s = s_next
            if done:
                break
        
        rewards.append(total_reward)
        print(f"Episode {ep+1}: {total_reward:.2f}")
    
    env.close()
    return rewards

def compare_results():
    """Compare epsilon=0 results across different action counts"""
    
    # Teacher's baseline results
    teacher_mean = -212.00
    teacher_std = 99.12
    
    print("="*60)
    print("EPSILON=0 EVALUATION COMPARISON")
    print("="*60)
    print(f"Teacher Baseline (40 actions): {teacher_mean:.2f} ± {teacher_std:.2f}")
    print("-" * 40)
    
    action_counts = [5, 11, 21, 50]
    results = {}
    
    for n_actions in action_counts:
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        try:
            rewards = evaluate_epsilon_zero(experiment_prefix, n_actions)
            mean_reward = np.mean(rewards)
            std_reward = np.std(rewards)
            
            results[n_actions] = {
                'mean': mean_reward,
                'std': std_reward,
                'rewards': rewards
            }
            
            improvement = mean_reward - teacher_mean
            variance_reduction = teacher_std - std_reward
            
            print(f"{n_actions:2d} actions: {mean_reward:7.2f} ± {std_reward:5.2f} "
                  f"({improvement:+6.2f} vs baseline, variance {variance_reduction:+5.2f})")
            
        except FileNotFoundError:
            print(f"{n_actions:2d} actions: Weights file not found")
    
    # Find best configuration
    if results:
        best_actions = max(results.keys(), key=lambda k: results[k]['mean'])
        best_mean = results[best_actions]['mean']
        best_std = results[best_actions]['std']
        
        print(f"\nBest configuration: {best_actions} actions")
        print(f"Performance: {best_mean:.2f} ± {best_std:.2f}")
        print(f"Improvement over teacher: {best_mean - teacher_mean:+.2f} points")
    
    return results
In [8]:
def compare_results():
    """Compare epsilon=0 results across different action counts"""
    
    # Practical results
    teacher_mean = -212.00
    teacher_std = 99.12
    
    print("="*60)
    print("EPSILON=0 EVALUATION COMPARISON")
    print("="*60)
    print(f"Teacher Baseline (40 actions): {teacher_mean:.2f} ± {teacher_std:.2f}")
    print("-" * 40)
    
    action_counts = [5, 11, 21, 50]
    results = {}
    
    for n_actions in action_counts:
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        try:
            rewards = evaluate_epsilon_zero(experiment_prefix, n_actions)
            mean_reward = np.mean(rewards)
            std_reward = np.std(rewards)
            
            results[n_actions] = {
                'mean': mean_reward,
                'std': std_reward,
                'rewards': rewards
            }
            
            improvement = mean_reward - teacher_mean
            variance_reduction = teacher_std - std_reward
            
            print(f"{n_actions:2d} actions: {mean_reward:7.2f} ± {std_reward:5.2f} "
                  f"({improvement:+6.2f} vs baseline, variance {variance_reduction:+5.2f})")
            
        except FileNotFoundError:
            print(f"{n_actions:2d} actions: Weights file not found")
    
    # Find best configuration
    if results:
        best_actions = max(results.keys(), key=lambda k: results[k]['mean'])
        best_mean = results[best_actions]['mean']
        best_std = results[best_actions]['std']
        
        print(f"\nBest configuration: {best_actions} actions")
        print(f"Performance: {best_mean:.2f} ± {best_std:.2f}")
        print(f"Improvement over teacher: {best_mean - teacher_mean:+.2f} points")
    
    return results
In [9]:
def plot_comparison(results):
    """Simple comparison plot"""
    if not results:
        return
    
    actions = sorted(results.keys())
    means = [results[a]['mean'] for a in actions]
    stds = [results[a]['std'] for a in actions]
    
    plt.figure(figsize=(10, 6))
    plt.errorbar(actions, means, yerr=stds, marker='o', capsize=5, linewidth=2)
    plt.axhline(y=-212.00, color='red', linestyle='--', label='Teacher Baseline')
    plt.xlabel('Number of Actions')
    plt.ylabel('Mean Reward (Epsilon=0)')
    plt.title('Action Space Discretization Effect on Performance')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.savefig('epsilon_zero_comparison.png', dpi=150)
    plt.show()
In [34]:
if __name__ == "__main__":
    # Run epsilon=0 evaluation for all saved models
    results = compare_results()
    
    # Create comparison plot
    plot_comparison(results)
    
    print(f"\nComparison plot saved as 'epsilon_zero_comparison.png'")
============================================================
EPSILON=0 EVALUATION COMPARISON
============================================================
Teacher Baseline (40 actions): -212.00 ± 99.12
----------------------------------------

Evaluating 5 actions with epsilon=0.0
Loaded weights: dqn_pendulum_5actions_weights.h5
Episode 1: -115.92
Episode 2: -0.95
Episode 3: -121.01
Episode 4: -121.71
Episode 5: -120.03
Episode 6: -122.19
Episode 7: -120.91
Episode 8: -126.53
Episode 9: -366.49
Episode 10: -121.03
 5 actions: -133.68 ± 85.52 (+78.32 vs baseline, variance +13.60)

Evaluating 11 actions with epsilon=0.0
Loaded weights: dqn_pendulum_11actions_weights.h5
Episode 1: -122.67
Episode 2: -3.46
Episode 3: -223.48
Episode 4: -124.60
Episode 5: -374.11
Episode 6: -122.20
Episode 7: -429.12
Episode 8: -120.93
Episode 9: -121.08
Episode 10: -243.36
11 actions: -188.50 ± 123.59 (+23.50 vs baseline, variance -24.47)

Evaluating 21 actions with epsilon=0.0
Loaded weights: dqn_pendulum_21actions_weights.h5
Episode 1: -0.56
Episode 2: -126.49
Episode 3: -238.86
Episode 4: -119.29
Episode 5: -127.82
Episode 6: -362.82
Episode 7: -247.10
Episode 8: -251.83
Episode 9: -126.06
Episode 10: -127.83
21 actions: -172.86 ± 96.51 (+39.14 vs baseline, variance +2.61)

Evaluating 50 actions with epsilon=0.0
Loaded weights: dqn_pendulum_50actions_weights.h5
Episode 1: -247.63
Episode 2: -126.70
Episode 3: -131.06
Episode 4: -247.00
Episode 5: -261.46
Episode 6: -128.84
Episode 7: -1.50
Episode 8: -245.35
Episode 9: -230.53
Episode 10: -230.52
50 actions: -185.06 ± 80.33 (+26.94 vs baseline, variance +18.79)

Best configuration: 5 actions
Performance: -133.68 ± 85.52
Improvement over teacher: +78.32 points
No description has been provided for this image
Comparison plot saved as 'epsilon_zero_comparison.png'
In [9]:
def visualize_checkpoint(weights_path, n_actions, gif_path, max_steps=200, input_shape=3):
    ENV_NAME = "Pendulum-v0"
    
    # Create agent with Version 2 syntax
    agent = DQNAgent(
        input_shape=input_shape,
        n_actions=n_actions,
        gamma=0.99,
        replay_memory_size=50000,
        min_replay_memory=1000,
        batch_size=64,
        target_update_every=5,
        learning_rate=3e-4,
        epsilon_start=0.0,    # Greedy for visualization
        epsilon_min=0.0,
        epsilon_decay=1.0
    )
    
    if not os.path.exists(weights_path):
        print(f"Warning: {weights_path} not found!")
        return
        
    agent.load(weights_path)
    env = gym.make(ENV_NAME)
    s = env.reset()
    s = s if isinstance(s, np.ndarray) else s[0]
    
    frames = []
    total_reward = 0
    
    for t in range(max_steps):
        frame = env.render(mode='rgb_array')
        frames.append(frame)
        a_idx = agent.select_action(s)
        torque = action_index_to_torque(a_idx, n_actions)
        s_next, r, done, info = env.step(torque)
        s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
        s = s_next
        total_reward += r
        if done:
            break
    
    env.close()
    imageio.mimsave(gif_path, frames, fps=30)
    print(f"Saved GIF to {gif_path} (Total reward: {total_reward:.2f})")
In [25]:
# Run this after training to generate GIFs for all experiments
def generate_all_gifs():
    checkpoints = [50, 100, 150, 200]
    n_actions_list = [5, 11, 21, 50]
    input_shape = 3
    
    for n_actions in n_actions_list:
        print(f"\nGenerating GIFs for N_ACTIONS = {n_actions}")
        experiment_prefix = f"dqn_pendulum_{n_actions}actions"
        
        for ep in checkpoints:
            weights_path = f"{experiment_prefix}_{ep}_weights.h5"
            gif_path = f"{experiment_prefix}_{ep}_episode.gif"
            visualize_checkpoint(weights_path, n_actions, gif_path, input_shape=input_shape)
In [26]:
generate_all_gifs()
Generating GIFs for N_ACTIONS = 5
Saved GIF to dqn_pendulum_5actions_50_episode.gif (Total reward: -265.86)
Saved GIF to dqn_pendulum_5actions_100_episode.gif (Total reward: -766.59)
Saved GIF to dqn_pendulum_5actions_150_episode.gif (Total reward: -247.95)
Saved GIF to dqn_pendulum_5actions_200_episode.gif (Total reward: -0.85)

Generating GIFs for N_ACTIONS = 11
Saved GIF to dqn_pendulum_11actions_50_episode.gif (Total reward: -380.40)
Saved GIF to dqn_pendulum_11actions_100_episode.gif (Total reward: -1036.88)
Saved GIF to dqn_pendulum_11actions_150_episode.gif (Total reward: -244.15)
Saved GIF to dqn_pendulum_11actions_200_episode.gif (Total reward: -248.08)

Generating GIFs for N_ACTIONS = 21
Saved GIF to dqn_pendulum_21actions_50_episode.gif (Total reward: -1083.52)
Saved GIF to dqn_pendulum_21actions_100_episode.gif (Total reward: -1358.25)
Saved GIF to dqn_pendulum_21actions_150_episode.gif (Total reward: -1.29)
Saved GIF to dqn_pendulum_21actions_200_episode.gif (Total reward: -123.62)

Generating GIFs for N_ACTIONS = 50
Saved GIF to dqn_pendulum_50actions_50_episode.gif (Total reward: -1231.66)
Saved GIF to dqn_pendulum_50actions_100_episode.gif (Total reward: -1238.34)
Saved GIF to dqn_pendulum_50actions_150_episode.gif (Total reward: -567.67)
Saved GIF to dqn_pendulum_50actions_200_episode.gif (Total reward: -350.33)

Observations (epsilon = 0.0) ¶

  1. N_ACTIONS = 5: Best performer
  • Evaluation avg: -133.68 ± 85.52
  • Improvement over teacher baseline: +78.32
  • Most stable performance and clearest benefit from coarse discretization.

  1. N_ACTIONS = 21: Second best
  • Evaluation avg: -172.86 ± 96.51
  • Improvement over teacher baseline: +39.14
  • Stronger than 11 & 50 actions despite larger action space.

  1. N_ACTIONS = 50: Third
  • Evaluation avg: -185.06 ± 80.33
  • Improvement over baseline: +26.94
  • Surprisingly decent stability, but limited improvement for the cost.

  1. N_ACTIONS = 11: Worst performer
  • Evaluation avg: -188.50 ± 123.59
  • Improvement over baseline: +23.50
  • Very high variance, inconsistent learning across episodes.

Key Insights ¶

1. Coarse Discretization Advantage

  • N_ACTIONS = 5 clearly outperforms all others
  • Simpler action space allows for more effective exploration and learning
  • Fewer Q-values to learn = faster convergence and better final performance

2. Learning Stability

  • N_ACTIONS = 5: Standard deviation = ±85.52, reasonably stable.
  • N_ACTIONS = 11: Highest variance (±123.59) — learning is more erratic and less predictable.
  • Increasing n_actions leads to greater variability in performance due to over-fragmentation of action space.

3. Training Efficiency

  • All configurations trained for 200 episodes; however:
    • N_ACTIONS = 5 shows the best return for training time invested.
    • Higher action resolutions (21 & 50) incur more computational cost but don’t translate to better policy performance.

4. Exploration vs Exploitation Trade-off

  • From the training curves:
    • N_ACTIONS = 5: Smooth, steady improvement
    • N_ACTIONS = 50: More erratic learning, suggesting exploration challenges in high-dimensional action space

But the result above goes against what I learnt which was that: ¶

  • Lower N_ACTIONS: Higher variance due to coarser control

  • Higher N_ACTIONS: Lower variance due to finer control granularity

N_ACTIONS = 5 not only achieved the best average reward, but also exhibited relatively low variance. N_ACTIONS = 11 and 21, despite having finer control, suffered from higher variability and less consistent results.¶

let's try using diagnostic tests to figure out if there is an issue¶


Quick diagnostic tests:¶

What does it do ?

This test analyzes how our trained agents actually behave by examining which actions they choose in real scenarios, helping us understand if they learned good policies or just got lucky.

Why is this test important?¶

  1. Validates Performance Results
  • Good performance + good action usage = truly learned policy
  • Good performance + poor action usage = got lucky
  1. Reveals Learning Quality
  • N_ACTIONS = 5: Actually learned to control pendulum
  • N_ACTIONS = 50: Gave up learning, uses crude policy
  1. Explains Variance
  • Focused policies (low entropy) → consistent performance
  • Scattered policies (high entropy) → unpredictable performance
In [16]:
def analyze_action_usage(weights_path, n_actions, n_test_episodes=100, render=False):
    # Set up agent
    agent = DQNAgent(3, n_actions, 0.99, 50000, 1000, 64, 5, 3e-4, 0.0, 0.0, 1.0)
    agent.load(weights_path)
    agent.epsilon = 0.0  # Make sure it’s pure exploitation
    
    env = gym.make('Pendulum-v0')
    action_counts = np.zeros(n_actions)
    
    for ep in range(n_test_episodes):
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]

        for t in range(200):
            a_idx = agent.select_action(s)
            action_counts[a_idx] += 1
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, _ = env.step(torque)
            s = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            if render:
                env.render()
            if done:
                break

    env.close()

    # Normalize and calculate entropy
    action_probs = action_counts / np.sum(action_counts)
    entropy = -np.sum(action_probs * np.log(action_probs + 1e-8))  # avoid log(0)

    return action_probs, entropy
In [15]:
# Run diagnostic for all models
for n_actions in [5, 11, 21, 50]:
    weights_path = f"dqn_pendulum_{n_actions}actions_weights.h5"
    if os.path.exists(weights_path):
        probs, entropy = analyze_action_usage(weights_path, n_actions)
        print(f"N_ACTIONS = {n_actions}")
        print(f"Action usage distribution: {np.round(probs, 5)}")
        print(f"Entropy (diversity): {entropy:.3f}")
        print("-" * 50)
    else:
        print(f"Weights not found for N={n_actions}")
N_ACTIONS = 5
Action usage distribution: [0.0812  0.3791  0.0089  0.35555 0.17525]
Entropy (diversity): 1.286
--------------------------------------------------
N_ACTIONS = 11
Action usage distribution: [0.01905 0.0483  0.00255 0.31255 0.0129  0.0074  0.22365 0.00685 0.00305
 0.32545 0.03825]
Entropy (diversity): 1.570
--------------------------------------------------
N_ACTIONS = 21
Action usage distribution: [0.0007  0.00105 0.0648  0.      0.21615 0.      0.      0.0087  0.
 0.      0.4201  0.00065 0.      0.0015  0.0014  0.00075 0.0069  0.00215
 0.2478  0.01245 0.0149 ]
Entropy (diversity): 1.466
--------------------------------------------------
N_ACTIONS = 50
Action usage distribution: [0.06765 0.0006  0.00165 0.00155 0.      0.00975 0.      0.0429  0.
 0.0059  0.      0.      0.56215 0.      0.      0.      0.      0.
 0.      0.      0.      0.      0.      0.      0.      0.      0.
 0.      0.      0.      0.      0.      0.      0.      0.      0.
 0.      0.      0.      0.      0.      0.0164  0.      0.003   0.24875
 0.      0.02135 0.00225 0.0161  0.     ]
Entropy (diversity): 1.335
--------------------------------------------------
N_ACTIONS Entropy Key Observation
5 1.286 Uses 4/5 actions well
11 1.570 Uses more actions, most diverse
21 1.466 Still diverse, ~5 dominant actions
50 1.335 Mostly collapsed to 5–6 actions

What is the actual meaning of entropy

  • measures the randomness or uncertainty of an agent's policy. It quantifies how much an agent is exploring versus exploiting its current knowledge.

  • High Entropy: A policy with high entropy means the agent assigns a more equal probability to all possible actions. The agent is uncertain about which action is best, so it explores a wider range of options. This is especially useful in the early stages of training to find new, potentially better rewards.

  • Low Entropy: A low-entropy policy means the agent is highly confident in its decision. It assigns a high probability to one or a few actions and a very low probability to others. This corresponds to the agent exploiting its knowledge by consistently choosing what it believes to be the best action. This is desirable once the agent has learned a good policy.

Observations

  1. N_ACTIONS = 5: Efficient Learner
  • Uses 4 out of 5 actions meaningfully.
  • Avoids zero torque (middle action), indicating it learned force is needed to move the pendulum.
  • Balanced use of positive and negative torques.
  • Entropy = 1.29 => focused but not too rigid.
  • Alighns with strong stable performance at epsilon = 0
  1. N_ACTIONS = 11: Most Diverse
  • Highest entropy (1.57).
  • Uses 9–10 actions non-trivially, suggesting good exploration and granularity.
  • Likely a sweet spot between expressiveness and learnability.
  • Though highest entropy with wide usage, it performed the worst which suggests diversity alone does not mean quality *(issue with limited training)*?
  1. N_ACTIONS = 21: Slight Collapse
  • ~5 actions used frequently, rest rarely or never used.
  • Entropy lower than N=11, suggesting slight over-complexity.
  • Agent still finds dominant actions but skips many fine-grained options.
  • Consistent with middling performance, likely due to increased difficulty in learning meaningful values for many similar actions.
  1. N_ACTIONS = 50: Clear Action Collapse
  • Over 56% usage in a single action (action 12).
  • Only 5–6 actions used with any meaningful frequency.
  • 40+ actions unused => agent likely couldn't learn their value.
  • Entropy = 1.33 => lower than N=11 despite larger space.
  • Suggests underfitting due to high action resolution and limited training episodes.
  • Consistent with middling performance, likely due to increased difficulty in learning meaningful values for many similar actions.

Hypothesis ¶

As the number of discrete actions (N_ACTIONS) increases, the agent requires more training to learn effectively, due to the increased size and granularity of the action space. Without enough experience, it may fail to explore or differentiate between similar actions, resulting in action collapse.

  • With N_ACTIONS = 5, the agent explores and uses nearly all actions meaningfully. The small action space is easier to learn from and requires less data.

  • At N_ACTIONS = 11, the agent achieves highest entropy, suggesting it can meaningfully differentiate and exploit a richer action set, just needs more training to fully utilise control?

  • With N_ACTIONS = 21 and especially 50, many actions go unused. The agent collapses to a few actions — not because they're best, but possibly because training was insufficient to distinguish subtle torque differences.

=> This supports the idea that more actions need more data, otherwise the agent may fall back to a bang-bang style of control using a few familiar actions.


Testing my hypothesis¶

We are going to rerun N_Actions = 5,11,21 (200eps), WHY?¶

Re-running N_ACTIONS=5 (200 episodes):

  • Scientific rigor: Ensures fair comparison with same random seed
  • Baseline verification: Confirms previous results weren't due to lucky randomization
  • Control group: Essential for validating that improvements in larger n_actions are due to more episodes, not other factors

Re-running N_ACTIONS=11 and 21 (200 episodes):

  • Direct comparison: I do need identical conditions to measure the effect of extended training
  • Before/after analysis: Compare 200ep vs 400ep/600ep for the same n_actions
In [35]:
def extended_training_experiment():
    """Test hypothesis: larger n_actions need more episodes"""
    
    # Configurations to test hypothesis
    configs = [
        {"n_actions": 5, "episodes": 200, "name": "5act_200ep_baseline"},
        {"n_actions": 11, "episodes": 200, "name": "11act_200ep_baseline"},  # Current performance
        {"n_actions": 11, "episodes": 400, "name": "11act_400ep_extended"},  # 2x training
        {"n_actions": 21, "episodes": 200, "name": "21act_200ep_baseline"},  # Current performance  
        {"n_actions": 21, "episodes": 600, "name": "21act_600ep_extended"},  # 3x training
    ]
    
    results = {}
    
    for config in configs:
        print("="*80)
        print(f"Running: {config['name']}")
        print("="*80)
        
        # Set seeds for fair comparison
        SEED = 42
        random.seed(SEED)
        np.random.seed(SEED)
        tf.random.set_seed(SEED)
        
        result = train_and_evaluate_extended(
            n_actions=config["n_actions"],
            n_episodes=config["episodes"], 
            experiment_prefix=config["name"]
        )
        
        results[config["name"]] = result
        
        # Save intermediate results
        with open("extended_training_results.json", "w") as f:
            json.dump(results, f, indent=2)
    
    # Analyze results
    analyze_extended_training_results(results)
    return results
In [36]:
def train_and_evaluate_extended(n_actions, n_episodes, experiment_prefix, RENDER_EVERY=50):
    """Modified version with proper action tracking"""
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200

    # File paths
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    TRAIN_PLOT_PATH = f"{experiment_prefix}_training_plot.png"

    env = gym.make(ENV_NAME)
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)

    print(f"\nModel Summary:")
    agent.summary()
    
    # Tracking variables
    scores = []
    epsilons = []
    episode_times = []
    all_episode_actions = []  # PROPER action storage
    action_usage_per_100ep = []
    best_avg_reward = -np.inf
    start_time = time.time()

    for ep in range(1, n_episodes + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0
        episode_actions = []

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            episode_actions.append(a_idx)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            agent.remember(s, a_idx, r, s_next, done)
            agent.train_step()
            s = s_next
            total_reward += r
            if done:
                break

        agent.decay_epsilon()
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints
        if ep in [100, 200, 300, 400, 500, 600]:
            agent.save(f"{experiment_prefix}_{ep}_weights.h5")
        
        scores.append(total_reward)
        epsilons.append(agent.epsilon)
        ep_time = time.time() - ep_start
        episode_times.append(ep_time)
        all_episode_actions.append(episode_actions)  # STORE episode actions
        
        # PROPER action usage tracking every 100 episodes
        if ep % 100 == 0:
            action_counts = np.zeros(n_actions)
            # Analyze last 100 episodes of actions
            start_ep = max(0, len(all_episode_actions) - 100)
            for ep_actions in all_episode_actions[start_ep:]:
                for action in ep_actions:
                    action_counts[action] += 1
            
            if np.sum(action_counts) > 0:  # Avoid division by zero
                action_probs = action_counts / np.sum(action_counts)
                action_usage_per_100ep.append(action_probs)
                
                # Print action usage analysis
                entropy = -np.sum(action_probs * np.log(action_probs + 1e-8))
                print(f"\n--- Episode {ep}: Action Usage Analysis ---")
                print(f"Action distribution: {action_probs}")
                print(f"Entropy (diversity): {entropy:.3f}")
                print("-" * 50)

        avg_reward = np.mean(scores[-10:])
        
        # Full progress display
        print(f"Episode {ep} | Total Reward: {total_reward:.2f} | "
              f"Avg(10): {avg_reward:.2f} | Epsilon: {agent.epsilon:.3f} | "
              f"Time: {ep_time:.2f}s")

        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            agent.save(SAVE_WEIGHTS_PATH)

    env.close()
    total_time = time.time() - start_time

    # Enhanced plotting
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    
    # Training progress
    ax1.plot(scores, alpha=0.7, label='Episode Reward')
    ax1.plot([np.mean(scores[max(0, i-9):i+1]) for i in range(len(scores))], 
             label='Moving Avg (10)', linewidth=2)
    ax1.set_xlabel('Episode')
    ax1.set_ylabel('Reward')
    ax1.set_title(f'Training Progress ({n_actions} actions, {n_episodes} episodes)')
    ax1.legend()
    
    # Epsilon decay
    ax2.plot(epsilons, color='red')
    ax2.set_xlabel('Episode')
    ax2.set_ylabel('Epsilon')
    ax2.set_title('Exploration Rate Over Time')
    
    #  learning progress/improvement analysis
    if len(scores) >= 100:
        learning_windows = []
        for i in range(100, len(scores), 50):
            improvement = np.mean(scores[i-50:i]) - np.mean(scores[i-100:i-50])
            learning_windows.append(improvement)
        ax3.plot(range(100, len(scores), 50), learning_windows)
        ax3.set_xlabel('Episode')
        ax3.set_ylabel('Improvement (50ep window)')
        ax3.set_title(' learning progress/improvements Over Time')
        ax3.axhline(y=0, color='red', linestyle='--', alpha=0.5)
    
    # Final performance distribution (if we have enough episodes)
    recent_scores = scores[-50:] if len(scores) >= 50 else scores
    ax4.hist(recent_scores, bins=15, alpha=0.7)
    ax4.set_xlabel('Reward')
    ax4.set_ylabel('Frequency')
    ax4.set_title(f'Recent Performance Distribution\nMean: {np.mean(recent_scores):.1f}±{np.std(recent_scores):.1f}')
    
    plt.tight_layout()
    plt.savefig(TRAIN_PLOT_PATH, dpi=300, bbox_inches='tight')
    plt.close()

    # Evaluation phase
    print(f"\nEvaluating trained model...")
    agent.load(SAVE_WEIGHTS_PATH)
    eval_rewards = []
    
    for ep in range(10):
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0
        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            total_reward += r
            s = s_next
            if done:
                break
        eval_rewards.append(total_reward)
        print(f"Test Episode {ep+1}: Total Reward = {total_reward:.2f}")
    
    env.close()
    
    # Print final summary like your original
    print(f"\nAverage Reward over 10 episodes: {np.mean(eval_rewards):.2f} ± {np.std(eval_rewards):.2f}")
    print(f"Best average reward over 10 episodes: {best_avg_reward:.2f}")
    print("Best model weights saved to:", SAVE_WEIGHTS_PATH)
    print(f"Total training time: {total_time:.2f}s")
    
    # Return comprehensive results
    results = {
        "n_actions": n_actions,
        "n_episodes": n_episodes,
        "training_time": total_time,
        "best_training_avg": best_avg_reward,
        "eval_mean": np.mean(eval_rewards),
        "eval_std": np.std(eval_rewards),
        "final_epsilon": agent.epsilon,
        "training_scores": scores,
        "episode_times": episode_times,
        "convergence_episode": None  # You could detect when learning plateaus
    }
    
    print(f"\n{experiment_prefix} Results:")
    print(f"Training best avg: {best_avg_reward:.2f}")
    print(f"Evaluation: {np.mean(eval_rewards):.2f} ± {np.std(eval_rewards):.2f}")
    print(f"Training time: {total_time:.1f}s")
    
    return results
In [32]:
def analyze_extended_training_results(results):
    """Analyze if more episodes helped larger n_actions"""
    
    print("\n" + "="*80)
    print("EXTENDED TRAINING ANALYSIS")
    print("="*80)
    
    # Compare baseline vs extended for each n_actions
    comparisons = [
        ("11act_200ep_baseline", "11act_400ep_extended", "N_ACTIONS=11"),
        ("21act_200ep_baseline", "21act_600ep_extended", "N_ACTIONS=21")
    ]
    
    for baseline_key, extended_key, label in comparisons:
        if baseline_key in results and extended_key in results:
            baseline = results[baseline_key]
            extended = results[extended_key]
            
            print(f"\n{label}:")
            print(f"  200 episodes: {baseline['eval_mean']:.1f} ± {baseline['eval_std']:.1f}")
            print(f"  Extended:     {extended['eval_mean']:.1f} ± {extended['eval_std']:.1f}")
            
            improvement = extended['eval_mean'] - baseline['eval_mean']
            time_cost = extended['training_time'] / baseline['training_time']
            
            print(f"  Improvement: {improvement:+.1f} ({improvement/abs(baseline['eval_mean'])*100:+.1f}%)")
            print(f"  Time cost: {time_cost:.1f}x longer")
            
            # Simple conclusion
            if improvement > 50:  # Significant improvement threshold
                print(f"Extended training HELPS for {label}")
            else:
                print(f"Extended training doesn't help much for {label}")
    
    # Overall conclusion
    print(f"\n{'='*40}")
    print("CONCLUSION:")
    print("If extended training doesn't help significantly,")
    print("then your friend's epsilon exploration idea becomes relevant!")
    print("="*40)
In [30]:
if __name__ == "__main__":
    extended_training_experiment()
================================================================================
Running: 5act_200ep_baseline
================================================================================

Model Summary:

Model Summary:
Model: "dqn_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_30 (Dense)            multiple                  256       
                                                                 
 dense_31 (Dense)            multiple                  4160      
                                                                 
 dense_32 (Dense)            multiple                  325       
                                                                 
=================================================================
Total params: 4741 (18.52 KB)
Trainable params: 4741 (18.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -989.16 | Avg(10): -989.16 | Epsilon: 0.995 | Time: 0.05s
Episode 2 | Total Reward: -1644.80 | Avg(10): -1316.98 | Epsilon: 0.990 | Time: 0.04s
Episode 3 | Total Reward: -1095.40 | Avg(10): -1243.12 | Epsilon: 0.985 | Time: 0.04s
Episode 4 | Total Reward: -1009.04 | Avg(10): -1184.60 | Epsilon: 0.980 | Time: 0.06s
Episode 5 | Total Reward: -1716.37 | Avg(10): -1290.95 | Epsilon: 0.975 | Time: 0.18s
Episode 6 | Total Reward: -957.91 | Avg(10): -1235.45 | Epsilon: 0.970 | Time: 7.14s
Episode 7 | Total Reward: -1083.73 | Avg(10): -1213.77 | Epsilon: 0.966 | Time: 7.09s
Episode 8 | Total Reward: -1237.72 | Avg(10): -1216.77 | Epsilon: 0.961 | Time: 7.23s
Episode 9 | Total Reward: -1004.25 | Avg(10): -1193.15 | Epsilon: 0.956 | Time: 7.95s
Episode 10 | Total Reward: -1590.19 | Avg(10): -1232.86 | Epsilon: 0.951 | Time: 7.38s
Episode 11 | Total Reward: -1624.80 | Avg(10): -1296.42 | Epsilon: 0.946 | Time: 7.11s
Episode 12 | Total Reward: -1393.32 | Avg(10): -1271.27 | Epsilon: 0.942 | Time: 7.09s
Episode 13 | Total Reward: -1376.08 | Avg(10): -1299.34 | Epsilon: 0.937 | Time: 8.44s
Episode 14 | Total Reward: -979.15 | Avg(10): -1296.35 | Epsilon: 0.932 | Time: 12.28s
Episode 15 | Total Reward: -1085.14 | Avg(10): -1233.23 | Epsilon: 0.928 | Time: 12.38s
Episode 16 | Total Reward: -884.42 | Avg(10): -1225.88 | Epsilon: 0.923 | Time: 12.39s
Episode 17 | Total Reward: -862.90 | Avg(10): -1203.80 | Epsilon: 0.918 | Time: 12.25s
Episode 18 | Total Reward: -1581.36 | Avg(10): -1238.16 | Epsilon: 0.914 | Time: 8.48s
Episode 19 | Total Reward: -888.39 | Avg(10): -1226.57 | Epsilon: 0.909 | Time: 9.01s
Episode 20 | Total Reward: -1119.30 | Avg(10): -1179.49 | Epsilon: 0.905 | Time: 8.44s
Episode 21 | Total Reward: -1087.63 | Avg(10): -1125.77 | Epsilon: 0.900 | Time: 8.29s
Episode 22 | Total Reward: -1200.76 | Avg(10): -1106.51 | Epsilon: 0.896 | Time: 8.40s
Episode 23 | Total Reward: -1444.47 | Avg(10): -1113.35 | Epsilon: 0.891 | Time: 8.13s
Episode 24 | Total Reward: -1063.90 | Avg(10): -1121.83 | Epsilon: 0.887 | Time: 7.73s
Episode 25 | Total Reward: -1442.30 | Avg(10): -1157.54 | Epsilon: 0.882 | Time: 7.66s
Episode 26 | Total Reward: -1486.38 | Avg(10): -1217.74 | Epsilon: 0.878 | Time: 7.42s
Episode 27 | Total Reward: -1224.68 | Avg(10): -1253.92 | Epsilon: 0.873 | Time: 7.56s
Episode 28 | Total Reward: -1055.77 | Avg(10): -1201.36 | Epsilon: 0.869 | Time: 7.69s
Episode 29 | Total Reward: -1192.85 | Avg(10): -1231.80 | Epsilon: 0.865 | Time: 7.23s
Episode 30 | Total Reward: -1252.51 | Avg(10): -1245.13 | Epsilon: 0.860 | Time: 7.62s
Episode 31 | Total Reward: -1144.44 | Avg(10): -1250.81 | Epsilon: 0.856 | Time: 7.71s
Episode 32 | Total Reward: -1657.10 | Avg(10): -1296.44 | Epsilon: 0.852 | Time: 7.60s
Episode 33 | Total Reward: -978.62 | Avg(10): -1249.86 | Epsilon: 0.848 | Time: 7.65s
Episode 34 | Total Reward: -1442.76 | Avg(10): -1287.74 | Epsilon: 0.843 | Time: 8.03s
Episode 35 | Total Reward: -1310.01 | Avg(10): -1274.51 | Epsilon: 0.839 | Time: 8.15s
Episode 36 | Total Reward: -861.10 | Avg(10): -1211.99 | Epsilon: 0.835 | Time: 7.83s
Episode 37 | Total Reward: -1006.90 | Avg(10): -1190.21 | Epsilon: 0.831 | Time: 7.80s
Episode 38 | Total Reward: -1094.90 | Avg(10): -1194.12 | Epsilon: 0.827 | Time: 7.87s
Episode 39 | Total Reward: -1360.90 | Avg(10): -1210.93 | Epsilon: 0.822 | Time: 7.82s
Episode 40 | Total Reward: -1056.01 | Avg(10): -1191.28 | Epsilon: 0.818 | Time: 8.12s
Episode 41 | Total Reward: -1032.54 | Avg(10): -1180.09 | Epsilon: 0.814 | Time: 7.71s
Episode 42 | Total Reward: -737.62 | Avg(10): -1088.14 | Epsilon: 0.810 | Time: 7.64s
Episode 43 | Total Reward: -1644.81 | Avg(10): -1154.76 | Epsilon: 0.806 | Time: 7.52s
Episode 44 | Total Reward: -1567.59 | Avg(10): -1167.24 | Epsilon: 0.802 | Time: 7.21s
Episode 45 | Total Reward: -1627.78 | Avg(10): -1199.02 | Epsilon: 0.798 | Time: 7.41s
Episode 46 | Total Reward: -1382.28 | Avg(10): -1251.13 | Epsilon: 0.794 | Time: 7.67s
Episode 47 | Total Reward: -1574.47 | Avg(10): -1307.89 | Epsilon: 0.790 | Time: 7.65s
Episode 48 | Total Reward: -1189.70 | Avg(10): -1317.37 | Epsilon: 0.786 | Time: 7.59s
Episode 49 | Total Reward: -1568.60 | Avg(10): -1338.14 | Epsilon: 0.782 | Time: 7.73s
Episode 50 | Total Reward: -1415.46 | Avg(10): -1374.08 | Epsilon: 0.778 | Time: 7.90s
Episode 51 | Total Reward: -1472.16 | Avg(10): -1418.05 | Epsilon: 0.774 | Time: 7.88s
Episode 52 | Total Reward: -898.64 | Avg(10): -1434.15 | Epsilon: 0.771 | Time: 7.88s
Episode 53 | Total Reward: -1084.02 | Avg(10): -1378.07 | Epsilon: 0.767 | Time: 7.78s
Episode 54 | Total Reward: -747.28 | Avg(10): -1296.04 | Epsilon: 0.763 | Time: 7.70s
Episode 55 | Total Reward: -967.53 | Avg(10): -1230.01 | Epsilon: 0.759 | Time: 7.74s
Episode 56 | Total Reward: -1376.80 | Avg(10): -1229.47 | Epsilon: 0.755 | Time: 7.68s
Episode 57 | Total Reward: -1084.73 | Avg(10): -1180.49 | Epsilon: 0.751 | Time: 7.44s
Episode 58 | Total Reward: -1417.29 | Avg(10): -1203.25 | Epsilon: 0.748 | Time: 7.43s
Episode 59 | Total Reward: -1202.40 | Avg(10): -1166.63 | Epsilon: 0.744 | Time: 7.55s
Episode 60 | Total Reward: -1346.79 | Avg(10): -1159.77 | Epsilon: 0.740 | Time: 7.45s
Episode 61 | Total Reward: -1089.11 | Avg(10): -1121.46 | Epsilon: 0.737 | Time: 7.48s
Episode 62 | Total Reward: -1123.06 | Avg(10): -1143.90 | Epsilon: 0.733 | Time: 8.21s
Episode 63 | Total Reward: -1202.43 | Avg(10): -1155.74 | Epsilon: 0.729 | Time: 7.93s
Episode 64 | Total Reward: -847.64 | Avg(10): -1165.78 | Epsilon: 0.726 | Time: 7.65s
Episode 65 | Total Reward: -909.94 | Avg(10): -1160.02 | Epsilon: 0.722 | Time: 7.77s
Episode 66 | Total Reward: -794.78 | Avg(10): -1101.82 | Epsilon: 0.718 | Time: 7.92s
Episode 67 | Total Reward: -833.43 | Avg(10): -1076.69 | Epsilon: 0.715 | Time: 7.81s
Episode 68 | Total Reward: -753.80 | Avg(10): -1010.34 | Epsilon: 0.711 | Time: 8.15s
Episode 69 | Total Reward: -789.91 | Avg(10): -969.09 | Epsilon: 0.708 | Time: 7.66s
Episode 70 | Total Reward: -887.58 | Avg(10): -923.17 | Epsilon: 0.704 | Time: 7.31s
Episode 71 | Total Reward: -1239.23 | Avg(10): -938.18 | Epsilon: 0.701 | Time: 7.19s
Episode 72 | Total Reward: -866.49 | Avg(10): -912.52 | Epsilon: 0.697 | Time: 8.06s
Episode 73 | Total Reward: -1022.75 | Avg(10): -894.56 | Epsilon: 0.694 | Time: 7.46s
Episode 74 | Total Reward: -814.03 | Avg(10): -891.19 | Epsilon: 0.690 | Time: 7.48s
Episode 75 | Total Reward: -873.11 | Avg(10): -887.51 | Epsilon: 0.687 | Time: 7.61s
Episode 76 | Total Reward: -1212.55 | Avg(10): -929.29 | Epsilon: 0.683 | Time: 7.44s
Episode 77 | Total Reward: -755.15 | Avg(10): -921.46 | Epsilon: 0.680 | Time: 7.37s
Episode 78 | Total Reward: -964.92 | Avg(10): -942.57 | Epsilon: 0.676 | Time: 7.51s
Episode 79 | Total Reward: -1061.57 | Avg(10): -969.74 | Epsilon: 0.673 | Time: 7.89s
Episode 80 | Total Reward: -787.10 | Avg(10): -959.69 | Epsilon: 0.670 | Time: 7.56s
Episode 81 | Total Reward: -776.53 | Avg(10): -913.42 | Epsilon: 0.666 | Time: 7.79s
Episode 82 | Total Reward: -866.40 | Avg(10): -913.41 | Epsilon: 0.663 | Time: 7.86s
Episode 83 | Total Reward: -908.89 | Avg(10): -902.02 | Epsilon: 0.660 | Time: 7.92s
Episode 84 | Total Reward: -1004.26 | Avg(10): -921.05 | Epsilon: 0.656 | Time: 7.68s
Episode 85 | Total Reward: -777.21 | Avg(10): -911.46 | Epsilon: 0.653 | Time: 7.70s
Episode 86 | Total Reward: -876.41 | Avg(10): -877.84 | Epsilon: 0.650 | Time: 7.61s
Episode 87 | Total Reward: -1005.08 | Avg(10): -902.84 | Epsilon: 0.647 | Time: 7.79s
Episode 88 | Total Reward: -861.64 | Avg(10): -892.51 | Epsilon: 0.643 | Time: 7.57s
Episode 89 | Total Reward: -867.87 | Avg(10): -873.14 | Epsilon: 0.640 | Time: 7.36s
Episode 90 | Total Reward: -860.92 | Avg(10): -880.52 | Epsilon: 0.637 | Time: 7.47s
Episode 91 | Total Reward: -894.03 | Avg(10): -892.27 | Epsilon: 0.634 | Time: 7.39s
Episode 92 | Total Reward: -668.44 | Avg(10): -872.47 | Epsilon: 0.631 | Time: 7.49s
Episode 93 | Total Reward: -981.19 | Avg(10): -879.71 | Epsilon: 0.627 | Time: 7.60s
Episode 94 | Total Reward: -873.89 | Avg(10): -866.67 | Epsilon: 0.624 | Time: 7.68s
Episode 95 | Total Reward: -821.02 | Avg(10): -871.05 | Epsilon: 0.621 | Time: 7.67s
Episode 96 | Total Reward: -784.48 | Avg(10): -861.86 | Epsilon: 0.618 | Time: 7.60s
Episode 97 | Total Reward: -985.41 | Avg(10): -859.89 | Epsilon: 0.615 | Time: 7.92s
Episode 98 | Total Reward: -929.97 | Avg(10): -866.72 | Epsilon: 0.612 | Time: 7.95s
Episode 99 | Total Reward: -628.97 | Avg(10): -842.83 | Epsilon: 0.609 | Time: 8.01s

--- Episode 100: Action Usage Analysis ---
Action distribution: [0.2443  0.1772  0.1658  0.18055 0.23215]
Entropy (diversity): 1.597
--------------------------------------------------
Episode 100 | Total Reward: -633.05 | Avg(10): -820.05 | Epsilon: 0.606 | Time: 7.76s
Episode 101 | Total Reward: -381.30 | Avg(10): -768.77 | Epsilon: 0.603 | Time: 7.61s
Episode 102 | Total Reward: -855.61 | Avg(10): -787.49 | Epsilon: 0.600 | Time: 7.93s
Episode 103 | Total Reward: -903.22 | Avg(10): -779.69 | Epsilon: 0.597 | Time: 7.92s
Episode 104 | Total Reward: -1018.75 | Avg(10): -794.18 | Epsilon: 0.594 | Time: 7.46s
Episode 105 | Total Reward: -873.39 | Avg(10): -799.41 | Epsilon: 0.591 | Time: 7.44s
Episode 106 | Total Reward: -1044.15 | Avg(10): -825.38 | Epsilon: 0.588 | Time: 7.38s
Episode 107 | Total Reward: -860.93 | Avg(10): -812.93 | Epsilon: 0.585 | Time: 7.41s
Episode 108 | Total Reward: -795.47 | Avg(10): -799.48 | Epsilon: 0.582 | Time: 7.64s
Episode 109 | Total Reward: -860.79 | Avg(10): -822.67 | Epsilon: 0.579 | Time: 7.51s
Episode 110 | Total Reward: -631.22 | Avg(10): -822.48 | Epsilon: 0.576 | Time: 7.97s
Episode 111 | Total Reward: -885.46 | Avg(10): -872.90 | Epsilon: 0.573 | Time: 8.04s
Episode 112 | Total Reward: -768.18 | Avg(10): -864.16 | Epsilon: 0.570 | Time: 8.14s
Episode 113 | Total Reward: -806.69 | Avg(10): -854.50 | Epsilon: 0.568 | Time: 8.14s
Episode 114 | Total Reward: -601.62 | Avg(10): -812.79 | Epsilon: 0.565 | Time: 8.15s
Episode 115 | Total Reward: -907.45 | Avg(10): -816.20 | Epsilon: 0.562 | Time: 8.23s
Episode 116 | Total Reward: -625.10 | Avg(10): -774.29 | Epsilon: 0.559 | Time: 7.86s
Episode 117 | Total Reward: -515.90 | Avg(10): -739.79 | Epsilon: 0.556 | Time: 8.30s
Episode 118 | Total Reward: -955.28 | Avg(10): -755.77 | Epsilon: 0.554 | Time: 8.26s
Episode 119 | Total Reward: -650.22 | Avg(10): -734.71 | Epsilon: 0.551 | Time: 8.18s
Episode 120 | Total Reward: -636.82 | Avg(10): -735.27 | Epsilon: 0.548 | Time: 7.93s
Episode 121 | Total Reward: -1109.44 | Avg(10): -757.67 | Epsilon: 0.545 | Time: 7.95s
Episode 122 | Total Reward: -896.38 | Avg(10): -770.49 | Epsilon: 0.543 | Time: 8.02s
Episode 123 | Total Reward: -599.39 | Avg(10): -749.76 | Epsilon: 0.540 | Time: 8.04s
Episode 124 | Total Reward: -408.81 | Avg(10): -730.48 | Epsilon: 0.537 | Time: 8.10s
Episode 125 | Total Reward: -741.76 | Avg(10): -713.91 | Epsilon: 0.534 | Time: 8.22s
Episode 126 | Total Reward: -625.93 | Avg(10): -713.99 | Epsilon: 0.532 | Time: 8.13s
Episode 127 | Total Reward: -260.25 | Avg(10): -688.43 | Epsilon: 0.529 | Time: 8.22s
Episode 128 | Total Reward: -994.26 | Avg(10): -692.33 | Epsilon: 0.526 | Time: 8.26s
Episode 129 | Total Reward: -505.37 | Avg(10): -677.84 | Epsilon: 0.524 | Time: 8.26s
Episode 130 | Total Reward: -615.67 | Avg(10): -675.73 | Epsilon: 0.521 | Time: 8.36s
Episode 131 | Total Reward: -635.31 | Avg(10): -628.31 | Epsilon: 0.519 | Time: 8.39s
Episode 132 | Total Reward: -715.75 | Avg(10): -610.25 | Epsilon: 0.516 | Time: 8.12s
Episode 133 | Total Reward: -406.16 | Avg(10): -590.93 | Epsilon: 0.513 | Time: 8.26s
Episode 134 | Total Reward: -646.18 | Avg(10): -614.66 | Epsilon: 0.511 | Time: 8.48s
Episode 135 | Total Reward: -822.11 | Avg(10): -622.70 | Epsilon: 0.508 | Time: 8.46s
Episode 136 | Total Reward: -756.20 | Avg(10): -635.73 | Epsilon: 0.506 | Time: 7.89s
Episode 137 | Total Reward: -429.85 | Avg(10): -652.69 | Epsilon: 0.503 | Time: 8.20s
Episode 138 | Total Reward: -628.15 | Avg(10): -616.07 | Epsilon: 0.501 | Time: 6.21s
Episode 139 | Total Reward: -512.34 | Avg(10): -616.77 | Epsilon: 0.498 | Time: 8.70s
Episode 140 | Total Reward: -511.76 | Avg(10): -606.38 | Epsilon: 0.496 | Time: 7.71s
Episode 141 | Total Reward: -1229.01 | Avg(10): -665.75 | Epsilon: 0.493 | Time: 8.23s
Episode 142 | Total Reward: -841.40 | Avg(10): -678.32 | Epsilon: 0.491 | Time: 8.85s
Episode 143 | Total Reward: -508.17 | Avg(10): -688.52 | Epsilon: 0.488 | Time: 10.44s
Episode 144 | Total Reward: -601.28 | Avg(10): -684.03 | Epsilon: 0.486 | Time: 13.20s
Episode 145 | Total Reward: -815.52 | Avg(10): -683.37 | Epsilon: 0.483 | Time: 13.43s
Episode 146 | Total Reward: -307.58 | Avg(10): -638.50 | Epsilon: 0.481 | Time: 13.06s
Episode 147 | Total Reward: -364.67 | Avg(10): -631.99 | Epsilon: 0.479 | Time: 13.23s
Episode 148 | Total Reward: -346.60 | Avg(10): -603.83 | Epsilon: 0.476 | Time: 12.86s
Episode 149 | Total Reward: -490.04 | Avg(10): -601.60 | Epsilon: 0.474 | Time: 9.39s
Episode 150 | Total Reward: -1018.61 | Avg(10): -652.29 | Epsilon: 0.471 | Time: 7.06s
Episode 151 | Total Reward: -121.62 | Avg(10): -541.55 | Epsilon: 0.469 | Time: 7.07s
Episode 152 | Total Reward: -489.29 | Avg(10): -506.34 | Epsilon: 0.467 | Time: 7.21s
Episode 153 | Total Reward: -910.70 | Avg(10): -546.59 | Epsilon: 0.464 | Time: 7.29s
Episode 154 | Total Reward: -373.05 | Avg(10): -523.77 | Epsilon: 0.462 | Time: 7.27s
Episode 155 | Total Reward: -510.56 | Avg(10): -493.27 | Epsilon: 0.460 | Time: 7.37s
Episode 156 | Total Reward: -478.21 | Avg(10): -510.33 | Epsilon: 0.458 | Time: 7.39s
Episode 157 | Total Reward: -258.34 | Avg(10): -499.70 | Epsilon: 0.455 | Time: 11.99s
Episode 158 | Total Reward: -380.37 | Avg(10): -503.08 | Epsilon: 0.453 | Time: 9.78s
Episode 159 | Total Reward: -512.68 | Avg(10): -505.34 | Epsilon: 0.451 | Time: 7.62s
Episode 160 | Total Reward: -253.64 | Avg(10): -428.85 | Epsilon: 0.448 | Time: 7.11s
Episode 161 | Total Reward: -746.98 | Avg(10): -491.38 | Epsilon: 0.446 | Time: 7.00s
Episode 162 | Total Reward: -611.23 | Avg(10): -503.58 | Epsilon: 0.444 | Time: 7.03s
Episode 163 | Total Reward: -491.31 | Avg(10): -461.64 | Epsilon: 0.442 | Time: 7.05s
Episode 164 | Total Reward: -130.50 | Avg(10): -437.38 | Epsilon: 0.440 | Time: 7.78s
Episode 165 | Total Reward: -231.98 | Avg(10): -409.52 | Epsilon: 0.437 | Time: 8.10s
Episode 166 | Total Reward: -258.24 | Avg(10): -387.53 | Epsilon: 0.435 | Time: 8.01s
Episode 167 | Total Reward: -503.96 | Avg(10): -412.09 | Epsilon: 0.433 | Time: 8.05s
Episode 168 | Total Reward: -129.13 | Avg(10): -386.96 | Epsilon: 0.431 | Time: 7.98s
Episode 169 | Total Reward: -498.42 | Avg(10): -385.54 | Epsilon: 0.429 | Time: 8.05s
Episode 170 | Total Reward: -253.59 | Avg(10): -385.53 | Epsilon: 0.427 | Time: 8.14s
Episode 171 | Total Reward: -253.79 | Avg(10): -336.22 | Epsilon: 0.424 | Time: 8.03s
Episode 172 | Total Reward: -381.60 | Avg(10): -313.25 | Epsilon: 0.422 | Time: 7.71s
Episode 173 | Total Reward: -251.36 | Avg(10): -289.26 | Epsilon: 0.420 | Time: 7.85s
Episode 174 | Total Reward: -248.40 | Avg(10): -301.05 | Epsilon: 0.418 | Time: 7.90s
Episode 175 | Total Reward: -377.90 | Avg(10): -315.64 | Epsilon: 0.416 | Time: 7.81s
Episode 176 | Total Reward: -252.70 | Avg(10): -315.09 | Epsilon: 0.414 | Time: 7.67s
Episode 177 | Total Reward: -380.15 | Avg(10): -302.71 | Epsilon: 0.412 | Time: 7.59s
Episode 178 | Total Reward: -255.67 | Avg(10): -315.36 | Epsilon: 0.410 | Time: 7.56s
Episode 179 | Total Reward: -382.77 | Avg(10): -303.79 | Epsilon: 0.408 | Time: 7.67s
Episode 180 | Total Reward: -493.58 | Avg(10): -327.79 | Epsilon: 0.406 | Time: 7.64s
Episode 181 | Total Reward: -623.62 | Avg(10): -364.78 | Epsilon: 0.404 | Time: 7.53s
Episode 182 | Total Reward: -373.01 | Avg(10): -363.92 | Epsilon: 0.402 | Time: 7.66s
Episode 183 | Total Reward: -376.75 | Avg(10): -376.46 | Epsilon: 0.400 | Time: 7.93s
Episode 184 | Total Reward: -495.27 | Avg(10): -401.14 | Epsilon: 0.398 | Time: 8.02s
Episode 185 | Total Reward: -624.65 | Avg(10): -425.82 | Epsilon: 0.396 | Time: 7.98s
Episode 186 | Total Reward: -884.71 | Avg(10): -489.02 | Epsilon: 0.394 | Time: 8.08s
Episode 187 | Total Reward: -250.66 | Avg(10): -476.07 | Epsilon: 0.392 | Time: 8.03s
Episode 188 | Total Reward: -247.49 | Avg(10): -475.25 | Epsilon: 0.390 | Time: 7.67s
Episode 189 | Total Reward: -375.36 | Avg(10): -474.51 | Epsilon: 0.388 | Time: 7.92s
Episode 190 | Total Reward: -253.23 | Avg(10): -450.47 | Epsilon: 0.386 | Time: 7.74s
Episode 191 | Total Reward: -121.52 | Avg(10): -400.26 | Epsilon: 0.384 | Time: 7.62s
Episode 192 | Total Reward: -858.90 | Avg(10): -448.85 | Epsilon: 0.382 | Time: 7.89s
Episode 193 | Total Reward: -370.16 | Avg(10): -448.20 | Epsilon: 0.380 | Time: 7.49s
Episode 194 | Total Reward: -374.32 | Avg(10): -436.10 | Epsilon: 0.378 | Time: 7.58s
Episode 195 | Total Reward: -414.11 | Avg(10): -415.05 | Epsilon: 0.376 | Time: 7.75s
Episode 196 | Total Reward: -361.78 | Avg(10): -362.75 | Epsilon: 0.374 | Time: 8.14s
Episode 197 | Total Reward: -253.98 | Avg(10): -363.09 | Epsilon: 0.373 | Time: 7.75s
Episode 198 | Total Reward: -383.66 | Avg(10): -376.70 | Epsilon: 0.371 | Time: 7.78s
Episode 199 | Total Reward: -501.08 | Avg(10): -389.27 | Epsilon: 0.369 | Time: 8.01s

--- Episode 200: Action Usage Analysis ---
Action distribution: [0.28185 0.15285 0.1517  0.15075 0.26285]
Entropy (diversity): 1.567
--------------------------------------------------
Episode 200 | Total Reward: -555.13 | Avg(10): -419.46 | Epsilon: 0.367 | Time: 8.12s

Evaluating trained model...
Test Episode 1: Total Reward = -249.84
Test Episode 2: Total Reward = -121.73
Test Episode 3: Total Reward = -237.21
Test Episode 4: Total Reward = -362.19
Test Episode 5: Total Reward = -377.29
Test Episode 6: Total Reward = -600.51
Test Episode 7: Total Reward = -247.18
Test Episode 8: Total Reward = -253.26
Test Episode 9: Total Reward = -373.24
Test Episode 10: Total Reward = -251.96

Average Reward over 10 episodes: -307.44 ± 122.67
Best average reward over 10 episodes: -289.26
Best model weights saved to: 5act_200ep_baseline_weights.h5
Total training time: 1573.81s

5act_200ep_baseline Results:
Training best avg: -289.26
Evaluation: -307.44 ± 122.67
Training time: 1573.8s
================================================================================
Running: 11act_200ep_baseline
================================================================================

Model Summary:

Model Summary:
Model: "dqn_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_36 (Dense)            multiple                  256       
                                                                 
 dense_37 (Dense)            multiple                  4160      
                                                                 
 dense_38 (Dense)            multiple                  715       
                                                                 
=================================================================
Total params: 5131 (20.04 KB)
Trainable params: 5131 (20.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1071.39 | Avg(10): -1071.39 | Epsilon: 0.995 | Time: 0.04s
Episode 2 | Total Reward: -1268.90 | Avg(10): -1170.15 | Epsilon: 0.990 | Time: 0.03s
Episode 3 | Total Reward: -1197.42 | Avg(10): -1179.24 | Epsilon: 0.985 | Time: 0.03s
Episode 4 | Total Reward: -1163.14 | Avg(10): -1175.21 | Epsilon: 0.980 | Time: 0.05s
Episode 5 | Total Reward: -1198.85 | Avg(10): -1179.94 | Epsilon: 0.975 | Time: 0.15s
Episode 6 | Total Reward: -1472.97 | Avg(10): -1228.78 | Epsilon: 0.970 | Time: 7.91s
Episode 7 | Total Reward: -875.16 | Avg(10): -1178.26 | Epsilon: 0.966 | Time: 7.78s
Episode 8 | Total Reward: -1711.82 | Avg(10): -1244.96 | Epsilon: 0.961 | Time: 7.74s
Episode 9 | Total Reward: -1332.53 | Avg(10): -1254.69 | Epsilon: 0.956 | Time: 7.75s
Episode 10 | Total Reward: -1486.20 | Avg(10): -1277.84 | Epsilon: 0.951 | Time: 7.66s
Episode 11 | Total Reward: -1160.98 | Avg(10): -1286.80 | Epsilon: 0.946 | Time: 7.55s
Episode 12 | Total Reward: -1253.12 | Avg(10): -1285.22 | Epsilon: 0.942 | Time: 7.53s
Episode 13 | Total Reward: -1179.82 | Avg(10): -1283.46 | Epsilon: 0.937 | Time: 7.62s
Episode 14 | Total Reward: -1502.85 | Avg(10): -1317.43 | Epsilon: 0.932 | Time: 7.64s
Episode 15 | Total Reward: -1069.88 | Avg(10): -1304.53 | Epsilon: 0.928 | Time: 7.49s
Episode 16 | Total Reward: -1250.26 | Avg(10): -1282.26 | Epsilon: 0.923 | Time: 7.85s
Episode 17 | Total Reward: -1622.24 | Avg(10): -1356.97 | Epsilon: 0.918 | Time: 7.69s
Episode 18 | Total Reward: -1176.57 | Avg(10): -1303.44 | Epsilon: 0.914 | Time: 7.74s
Episode 19 | Total Reward: -882.33 | Avg(10): -1258.42 | Epsilon: 0.909 | Time: 7.84s
Episode 20 | Total Reward: -1634.90 | Avg(10): -1273.29 | Epsilon: 0.905 | Time: 7.82s
Episode 21 | Total Reward: -1055.10 | Avg(10): -1262.71 | Epsilon: 0.900 | Time: 8.00s
Episode 22 | Total Reward: -1246.92 | Avg(10): -1262.09 | Epsilon: 0.896 | Time: 7.19s
Episode 23 | Total Reward: -1087.50 | Avg(10): -1252.85 | Epsilon: 0.891 | Time: 6.97s
Episode 24 | Total Reward: -1263.53 | Avg(10): -1228.92 | Epsilon: 0.887 | Time: 6.82s
Episode 25 | Total Reward: -1205.21 | Avg(10): -1242.45 | Epsilon: 0.882 | Time: 6.82s
Episode 26 | Total Reward: -1699.60 | Avg(10): -1287.39 | Epsilon: 0.878 | Time: 6.79s
Episode 27 | Total Reward: -1219.73 | Avg(10): -1247.14 | Epsilon: 0.873 | Time: 6.79s
Episode 28 | Total Reward: -751.20 | Avg(10): -1204.60 | Epsilon: 0.869 | Time: 6.73s
Episode 29 | Total Reward: -1491.76 | Avg(10): -1265.54 | Epsilon: 0.865 | Time: 6.78s
Episode 30 | Total Reward: -1733.70 | Avg(10): -1275.42 | Epsilon: 0.860 | Time: 6.78s
Episode 31 | Total Reward: -1638.34 | Avg(10): -1333.75 | Epsilon: 0.856 | Time: 6.78s
Episode 32 | Total Reward: -1747.32 | Avg(10): -1383.79 | Epsilon: 0.852 | Time: 6.71s
Episode 33 | Total Reward: -864.53 | Avg(10): -1361.49 | Epsilon: 0.848 | Time: 6.68s
Episode 34 | Total Reward: -1651.81 | Avg(10): -1400.32 | Epsilon: 0.843 | Time: 6.66s
Episode 35 | Total Reward: -1174.86 | Avg(10): -1397.29 | Epsilon: 0.839 | Time: 6.77s
Episode 36 | Total Reward: -1234.33 | Avg(10): -1350.76 | Epsilon: 0.835 | Time: 6.80s
Episode 37 | Total Reward: -1209.98 | Avg(10): -1349.78 | Epsilon: 0.831 | Time: 6.90s
Episode 38 | Total Reward: -1315.91 | Avg(10): -1406.25 | Epsilon: 0.827 | Time: 6.76s
Episode 39 | Total Reward: -1194.92 | Avg(10): -1376.57 | Epsilon: 0.822 | Time: 6.78s
Episode 40 | Total Reward: -806.92 | Avg(10): -1283.89 | Epsilon: 0.818 | Time: 6.87s
Episode 41 | Total Reward: -737.00 | Avg(10): -1193.76 | Epsilon: 0.814 | Time: 6.88s
Episode 42 | Total Reward: -746.94 | Avg(10): -1093.72 | Epsilon: 0.810 | Time: 6.88s
Episode 43 | Total Reward: -1300.34 | Avg(10): -1137.30 | Epsilon: 0.806 | Time: 6.82s
Episode 44 | Total Reward: -1316.17 | Avg(10): -1103.74 | Epsilon: 0.802 | Time: 6.80s
Episode 45 | Total Reward: -1203.65 | Avg(10): -1106.62 | Epsilon: 0.798 | Time: 6.90s
Episode 46 | Total Reward: -868.99 | Avg(10): -1070.08 | Epsilon: 0.794 | Time: 6.91s
Episode 47 | Total Reward: -1016.93 | Avg(10): -1050.78 | Epsilon: 0.790 | Time: 7.52s
Episode 48 | Total Reward: -1200.94 | Avg(10): -1039.28 | Epsilon: 0.786 | Time: 7.60s
Episode 49 | Total Reward: -861.88 | Avg(10): -1005.98 | Epsilon: 0.782 | Time: 7.69s
Episode 50 | Total Reward: -1283.46 | Avg(10): -1053.63 | Epsilon: 0.778 | Time: 7.68s
Episode 51 | Total Reward: -1217.61 | Avg(10): -1101.69 | Epsilon: 0.774 | Time: 7.04s
Episode 52 | Total Reward: -1084.50 | Avg(10): -1135.45 | Epsilon: 0.771 | Time: 6.92s
Episode 53 | Total Reward: -925.39 | Avg(10): -1097.95 | Epsilon: 0.767 | Time: 6.79s
Episode 54 | Total Reward: -1485.16 | Avg(10): -1114.85 | Epsilon: 0.763 | Time: 6.79s
Episode 55 | Total Reward: -1286.45 | Avg(10): -1123.13 | Epsilon: 0.759 | Time: 6.99s
Episode 56 | Total Reward: -916.62 | Avg(10): -1127.89 | Epsilon: 0.755 | Time: 6.84s
Episode 57 | Total Reward: -1091.65 | Avg(10): -1135.37 | Epsilon: 0.751 | Time: 6.96s
Episode 58 | Total Reward: -1065.47 | Avg(10): -1121.82 | Epsilon: 0.748 | Time: 6.92s
Episode 59 | Total Reward: -1030.89 | Avg(10): -1138.72 | Epsilon: 0.744 | Time: 6.93s
Episode 60 | Total Reward: -973.72 | Avg(10): -1107.74 | Epsilon: 0.740 | Time: 7.08s
Episode 61 | Total Reward: -862.81 | Avg(10): -1072.26 | Epsilon: 0.737 | Time: 6.86s
Episode 62 | Total Reward: -1100.71 | Avg(10): -1073.88 | Epsilon: 0.733 | Time: 6.94s
Episode 63 | Total Reward: -900.41 | Avg(10): -1071.39 | Epsilon: 0.729 | Time: 6.99s
Episode 64 | Total Reward: -866.27 | Avg(10): -1009.50 | Epsilon: 0.726 | Time: 6.94s
Episode 65 | Total Reward: -1099.45 | Avg(10): -990.80 | Epsilon: 0.722 | Time: 6.88s
Episode 66 | Total Reward: -896.70 | Avg(10): -988.81 | Epsilon: 0.718 | Time: 6.75s
Episode 67 | Total Reward: -1352.69 | Avg(10): -1014.91 | Epsilon: 0.715 | Time: 7.01s
Episode 68 | Total Reward: -1191.90 | Avg(10): -1027.55 | Epsilon: 0.711 | Time: 6.99s
Episode 69 | Total Reward: -1041.52 | Avg(10): -1028.62 | Epsilon: 0.708 | Time: 6.88s
Episode 70 | Total Reward: -902.51 | Avg(10): -1021.50 | Epsilon: 0.704 | Time: 6.80s
Episode 71 | Total Reward: -1092.36 | Avg(10): -1044.45 | Epsilon: 0.701 | Time: 7.09s
Episode 72 | Total Reward: -1034.16 | Avg(10): -1037.80 | Epsilon: 0.697 | Time: 6.87s
Episode 73 | Total Reward: -1052.86 | Avg(10): -1053.04 | Epsilon: 0.694 | Time: 6.90s
Episode 74 | Total Reward: -892.01 | Avg(10): -1055.62 | Epsilon: 0.690 | Time: 7.25s
Episode 75 | Total Reward: -896.34 | Avg(10): -1035.31 | Epsilon: 0.687 | Time: 6.94s
Episode 76 | Total Reward: -1037.08 | Avg(10): -1049.34 | Epsilon: 0.683 | Time: 7.69s
Episode 77 | Total Reward: -1034.49 | Avg(10): -1017.52 | Epsilon: 0.680 | Time: 7.13s
Episode 78 | Total Reward: -1053.62 | Avg(10): -1003.69 | Epsilon: 0.676 | Time: 6.88s
Episode 79 | Total Reward: -1012.51 | Avg(10): -1000.79 | Epsilon: 0.673 | Time: 98.61s
Episode 80 | Total Reward: -905.14 | Avg(10): -1001.06 | Epsilon: 0.670 | Time: 7.21s
Episode 81 | Total Reward: -1016.00 | Avg(10): -993.42 | Epsilon: 0.666 | Time: 7.57s
Episode 82 | Total Reward: -1139.58 | Avg(10): -1003.96 | Epsilon: 0.663 | Time: 7.09s
Episode 83 | Total Reward: -1029.91 | Avg(10): -1001.67 | Epsilon: 0.660 | Time: 8.62s
Episode 84 | Total Reward: -1415.71 | Avg(10): -1054.04 | Epsilon: 0.656 | Time: 6.96s
Episode 85 | Total Reward: -1134.02 | Avg(10): -1077.81 | Epsilon: 0.653 | Time: 7.06s
Episode 86 | Total Reward: -1047.69 | Avg(10): -1078.87 | Epsilon: 0.650 | Time: 7.00s
Episode 87 | Total Reward: -1180.73 | Avg(10): -1093.49 | Epsilon: 0.647 | Time: 7.06s
Episode 88 | Total Reward: -1136.87 | Avg(10): -1101.82 | Epsilon: 0.643 | Time: 6.98s
Episode 89 | Total Reward: -976.58 | Avg(10): -1098.22 | Epsilon: 0.640 | Time: 6.99s
Episode 90 | Total Reward: -1124.16 | Avg(10): -1120.13 | Epsilon: 0.637 | Time: 9.39s
Episode 91 | Total Reward: -1013.50 | Avg(10): -1119.88 | Epsilon: 0.634 | Time: 10.22s
Episode 92 | Total Reward: -1052.28 | Avg(10): -1111.15 | Epsilon: 0.631 | Time: 7.19s
Episode 93 | Total Reward: -1040.46 | Avg(10): -1112.20 | Epsilon: 0.627 | Time: 7.24s
Episode 94 | Total Reward: -1036.16 | Avg(10): -1074.25 | Epsilon: 0.624 | Time: 7.32s
Episode 95 | Total Reward: -902.32 | Avg(10): -1051.08 | Epsilon: 0.621 | Time: 7.33s
Episode 96 | Total Reward: -918.33 | Avg(10): -1038.14 | Epsilon: 0.618 | Time: 7.14s
Episode 97 | Total Reward: -906.62 | Avg(10): -1010.73 | Epsilon: 0.615 | Time: 7.15s
Episode 98 | Total Reward: -1030.94 | Avg(10): -1000.14 | Epsilon: 0.612 | Time: 6.86s
Episode 99 | Total Reward: -786.36 | Avg(10): -981.11 | Epsilon: 0.609 | Time: 7.11s

--- Episode 100: Action Usage Analysis ---
Action distribution: [0.12225 0.09905 0.07935 0.08685 0.0767  0.08925 0.0729  0.0807  0.0777
 0.1103  0.10495]
Entropy (diversity): 2.384
--------------------------------------------------
Episode 100 | Total Reward: -904.99 | Avg(10): -959.20 | Epsilon: 0.606 | Time: 7.05s
Episode 101 | Total Reward: -955.08 | Avg(10): -953.35 | Epsilon: 0.603 | Time: 6.91s
Episode 102 | Total Reward: -906.83 | Avg(10): -938.81 | Epsilon: 0.600 | Time: 6.97s
Episode 103 | Total Reward: -861.26 | Avg(10): -920.89 | Epsilon: 0.597 | Time: 6.88s
Episode 104 | Total Reward: -899.13 | Avg(10): -907.19 | Epsilon: 0.594 | Time: 6.92s
Episode 105 | Total Reward: -1032.80 | Avg(10): -920.23 | Epsilon: 0.591 | Time: 6.85s
Episode 106 | Total Reward: -770.23 | Avg(10): -905.42 | Epsilon: 0.588 | Time: 9.02s
Episode 107 | Total Reward: -919.34 | Avg(10): -906.69 | Epsilon: 0.585 | Time: 8.11s
Episode 108 | Total Reward: -1116.32 | Avg(10): -915.23 | Epsilon: 0.582 | Time: 7.72s
Episode 109 | Total Reward: -1113.10 | Avg(10): -947.91 | Epsilon: 0.579 | Time: 7.63s
Episode 110 | Total Reward: -999.08 | Avg(10): -957.32 | Epsilon: 0.576 | Time: 7.71s
Episode 111 | Total Reward: -936.89 | Avg(10): -955.50 | Epsilon: 0.573 | Time: 7.70s
Episode 112 | Total Reward: -881.04 | Avg(10): -952.92 | Epsilon: 0.570 | Time: 7.54s
Episode 113 | Total Reward: -1043.25 | Avg(10): -971.12 | Epsilon: 0.568 | Time: 7.44s
Episode 114 | Total Reward: -1005.02 | Avg(10): -981.71 | Epsilon: 0.565 | Time: 7.40s
Episode 115 | Total Reward: -895.66 | Avg(10): -967.99 | Epsilon: 0.562 | Time: 7.38s
Episode 116 | Total Reward: -1036.30 | Avg(10): -994.60 | Epsilon: 0.559 | Time: 7.89s
Episode 117 | Total Reward: -630.32 | Avg(10): -965.70 | Epsilon: 0.556 | Time: 7.63s
Episode 118 | Total Reward: -849.74 | Avg(10): -939.04 | Epsilon: 0.554 | Time: 9.75s
Episode 119 | Total Reward: -750.64 | Avg(10): -902.79 | Epsilon: 0.551 | Time: 7.07s
Episode 120 | Total Reward: -721.48 | Avg(10): -875.03 | Epsilon: 0.548 | Time: 6.99s
Episode 121 | Total Reward: -862.03 | Avg(10): -867.55 | Epsilon: 0.545 | Time: 7.02s
Episode 122 | Total Reward: -775.37 | Avg(10): -856.98 | Epsilon: 0.543 | Time: 7.20s
Episode 123 | Total Reward: -825.96 | Avg(10): -835.25 | Epsilon: 0.540 | Time: 7.24s
Episode 124 | Total Reward: -985.64 | Avg(10): -833.31 | Epsilon: 0.537 | Time: 7.36s
Episode 125 | Total Reward: -785.11 | Avg(10): -822.26 | Epsilon: 0.534 | Time: 7.24s
Episode 126 | Total Reward: -1025.12 | Avg(10): -821.14 | Epsilon: 0.532 | Time: 7.27s
Episode 127 | Total Reward: -744.57 | Avg(10): -832.57 | Epsilon: 0.529 | Time: 7.46s
Episode 128 | Total Reward: -1041.01 | Avg(10): -851.69 | Epsilon: 0.526 | Time: 7.26s
Episode 129 | Total Reward: -756.54 | Avg(10): -852.28 | Epsilon: 0.524 | Time: 7.16s
Episode 130 | Total Reward: -766.98 | Avg(10): -856.83 | Epsilon: 0.521 | Time: 7.12s
Episode 131 | Total Reward: -756.63 | Avg(10): -846.29 | Epsilon: 0.519 | Time: 7.04s
Episode 132 | Total Reward: -855.25 | Avg(10): -854.28 | Epsilon: 0.516 | Time: 7.20s
Episode 133 | Total Reward: -1163.75 | Avg(10): -888.06 | Epsilon: 0.513 | Time: 7.07s
Episode 134 | Total Reward: -1060.40 | Avg(10): -895.54 | Epsilon: 0.511 | Time: 8.21s
Episode 135 | Total Reward: -863.22 | Avg(10): -903.35 | Epsilon: 0.508 | Time: 7.93s
Episode 136 | Total Reward: -726.93 | Avg(10): -873.53 | Epsilon: 0.506 | Time: 7.72s
Episode 137 | Total Reward: -885.17 | Avg(10): -887.59 | Epsilon: 0.503 | Time: 7.33s
Episode 138 | Total Reward: -895.76 | Avg(10): -873.06 | Epsilon: 0.501 | Time: 7.98s
Episode 139 | Total Reward: -533.67 | Avg(10): -850.77 | Epsilon: 0.498 | Time: 8.04s
Episode 140 | Total Reward: -852.73 | Avg(10): -859.35 | Epsilon: 0.496 | Time: 8.06s
Episode 141 | Total Reward: -865.76 | Avg(10): -870.26 | Epsilon: 0.493 | Time: 7.97s
Episode 142 | Total Reward: -513.88 | Avg(10): -836.12 | Epsilon: 0.491 | Time: 7.74s
Episode 143 | Total Reward: -847.71 | Avg(10): -804.52 | Epsilon: 0.488 | Time: 7.81s
Episode 144 | Total Reward: -752.26 | Avg(10): -773.71 | Epsilon: 0.486 | Time: 8.08s
Episode 145 | Total Reward: -1075.06 | Avg(10): -794.89 | Epsilon: 0.483 | Time: 7.81s
Episode 146 | Total Reward: -496.29 | Avg(10): -771.83 | Epsilon: 0.481 | Time: 7.44s
Episode 147 | Total Reward: -907.60 | Avg(10): -774.07 | Epsilon: 0.479 | Time: 7.57s
Episode 148 | Total Reward: -468.28 | Avg(10): -731.32 | Epsilon: 0.476 | Time: 7.66s
Episode 149 | Total Reward: -898.12 | Avg(10): -767.77 | Epsilon: 0.474 | Time: 7.56s
Episode 150 | Total Reward: -765.09 | Avg(10): -759.00 | Epsilon: 0.471 | Time: 7.51s
Episode 151 | Total Reward: -744.94 | Avg(10): -746.92 | Epsilon: 0.469 | Time: 7.92s
Episode 152 | Total Reward: -122.86 | Avg(10): -707.82 | Epsilon: 0.467 | Time: 7.67s
Episode 153 | Total Reward: -482.97 | Avg(10): -671.35 | Epsilon: 0.464 | Time: 7.63s
Episode 154 | Total Reward: -602.89 | Avg(10): -656.41 | Epsilon: 0.462 | Time: 7.90s
Episode 155 | Total Reward: -749.48 | Avg(10): -623.85 | Epsilon: 0.460 | Time: 7.91s
Episode 156 | Total Reward: -626.06 | Avg(10): -636.83 | Epsilon: 0.458 | Time: 8.12s
Episode 157 | Total Reward: -742.42 | Avg(10): -620.31 | Epsilon: 0.455 | Time: 7.75s
Episode 158 | Total Reward: -254.36 | Avg(10): -598.92 | Epsilon: 0.453 | Time: 7.86s
Episode 159 | Total Reward: -255.01 | Avg(10): -534.61 | Epsilon: 0.451 | Time: 7.59s
Episode 160 | Total Reward: -374.33 | Avg(10): -495.53 | Epsilon: 0.448 | Time: 7.83s
Episode 161 | Total Reward: -128.66 | Avg(10): -433.91 | Epsilon: 0.446 | Time: 7.52s
Episode 162 | Total Reward: -752.36 | Avg(10): -496.85 | Epsilon: 0.444 | Time: 7.40s
Episode 163 | Total Reward: -187.52 | Avg(10): -467.31 | Epsilon: 0.442 | Time: 7.46s
Episode 164 | Total Reward: -515.58 | Avg(10): -458.58 | Epsilon: 0.440 | Time: 7.64s
Episode 165 | Total Reward: -359.12 | Avg(10): -419.54 | Epsilon: 0.437 | Time: 7.73s
Episode 166 | Total Reward: -870.56 | Avg(10): -443.99 | Epsilon: 0.435 | Time: 7.55s
Episode 167 | Total Reward: -499.71 | Avg(10): -419.72 | Epsilon: 0.433 | Time: 7.84s
Episode 168 | Total Reward: -125.39 | Avg(10): -406.82 | Epsilon: 0.431 | Time: 7.55s
Episode 169 | Total Reward: -382.33 | Avg(10): -419.56 | Epsilon: 0.429 | Time: 7.96s
Episode 170 | Total Reward: -253.56 | Avg(10): -407.48 | Epsilon: 0.427 | Time: 8.00s
Episode 171 | Total Reward: -251.98 | Avg(10): -419.81 | Epsilon: 0.424 | Time: 7.97s
Episode 172 | Total Reward: -376.28 | Avg(10): -382.20 | Epsilon: 0.422 | Time: 7.95s
Episode 173 | Total Reward: -378.81 | Avg(10): -401.33 | Epsilon: 0.420 | Time: 7.87s
Episode 174 | Total Reward: -134.37 | Avg(10): -363.21 | Epsilon: 0.418 | Time: 7.76s
Episode 175 | Total Reward: -849.13 | Avg(10): -412.21 | Epsilon: 0.416 | Time: 7.76s
Episode 176 | Total Reward: -717.63 | Avg(10): -396.92 | Epsilon: 0.414 | Time: 7.54s
Episode 177 | Total Reward: -126.70 | Avg(10): -359.62 | Epsilon: 0.412 | Time: 7.51s
Episode 178 | Total Reward: -733.85 | Avg(10): -420.46 | Epsilon: 0.410 | Time: 7.45s
Episode 179 | Total Reward: -127.88 | Avg(10): -395.02 | Epsilon: 0.408 | Time: 7.44s
Episode 180 | Total Reward: -383.44 | Avg(10): -408.01 | Epsilon: 0.406 | Time: 7.64s
Episode 181 | Total Reward: -500.16 | Avg(10): -432.82 | Epsilon: 0.404 | Time: 7.56s
Episode 182 | Total Reward: -131.11 | Avg(10): -408.31 | Epsilon: 0.402 | Time: 7.51s
Episode 183 | Total Reward: -653.63 | Avg(10): -435.79 | Epsilon: 0.400 | Time: 7.61s
Episode 184 | Total Reward: -130.15 | Avg(10): -435.37 | Epsilon: 0.398 | Time: 7.84s
Episode 185 | Total Reward: -618.07 | Avg(10): -412.26 | Epsilon: 0.396 | Time: 8.13s
Episode 186 | Total Reward: -256.33 | Avg(10): -366.13 | Epsilon: 0.394 | Time: 8.20s
Episode 187 | Total Reward: -126.54 | Avg(10): -366.12 | Epsilon: 0.392 | Time: 7.61s
Episode 188 | Total Reward: -278.12 | Avg(10): -320.54 | Epsilon: 0.390 | Time: 7.23s
Episode 189 | Total Reward: -597.29 | Avg(10): -367.48 | Epsilon: 0.388 | Time: 7.48s
Episode 190 | Total Reward: -376.05 | Avg(10): -366.74 | Epsilon: 0.386 | Time: 7.35s
Episode 191 | Total Reward: -376.96 | Avg(10): -354.42 | Epsilon: 0.384 | Time: 8.07s
Episode 192 | Total Reward: -253.65 | Avg(10): -366.68 | Epsilon: 0.382 | Time: 7.52s
Episode 193 | Total Reward: -624.77 | Avg(10): -363.79 | Epsilon: 0.380 | Time: 7.15s
Episode 194 | Total Reward: -488.69 | Avg(10): -399.65 | Epsilon: 0.378 | Time: 7.29s
Episode 195 | Total Reward: -126.33 | Avg(10): -350.47 | Epsilon: 0.376 | Time: 7.12s
Episode 196 | Total Reward: -616.22 | Avg(10): -386.46 | Epsilon: 0.374 | Time: 7.20s
Episode 197 | Total Reward: -252.12 | Avg(10): -399.02 | Epsilon: 0.373 | Time: 7.10s
Episode 198 | Total Reward: -358.40 | Avg(10): -407.05 | Epsilon: 0.371 | Time: 7.14s
Episode 199 | Total Reward: -242.21 | Avg(10): -371.54 | Epsilon: 0.369 | Time: 6.94s

--- Episode 200: Action Usage Analysis ---
Action distribution: [0.23245 0.0621  0.06345 0.07005 0.05615 0.05855 0.05075 0.0634  0.05755
 0.16745 0.1181 ]
Entropy (diversity): 2.243
--------------------------------------------------
Episode 200 | Total Reward: -523.44 | Avg(10): -386.28 | Epsilon: 0.367 | Time: 7.09s

Evaluating trained model...
Test Episode 1: Total Reward = -566.87
Test Episode 2: Total Reward = -131.63
Test Episode 3: Total Reward = -511.20
Test Episode 4: Total Reward = -362.71
Test Episode 5: Total Reward = -597.46
Test Episode 6: Total Reward = -126.79
Test Episode 7: Total Reward = -501.92
Test Episode 8: Total Reward = -618.27
Test Episode 9: Total Reward = -558.67
Test Episode 10: Total Reward = -375.98

Average Reward over 10 episodes: -435.15 ± 172.83
Best average reward over 10 episodes: -320.54
Best model weights saved to: 11act_200ep_baseline_weights.h5
Total training time: 1535.59s

11act_200ep_baseline Results:
Training best avg: -320.54
Evaluation: -435.15 ± 172.83
Training time: 1535.6s
================================================================================
Running: 11act_400ep_extended
================================================================================

Model Summary:

Model Summary:
Model: "dqn_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_42 (Dense)            multiple                  256       
                                                                 
 dense_43 (Dense)            multiple                  4160      
                                                                 
 dense_44 (Dense)            multiple                  715       
                                                                 
=================================================================
Total params: 5131 (20.04 KB)
Trainable params: 5131 (20.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1070.87 | Avg(10): -1070.87 | Epsilon: 0.995 | Time: 0.04s
Episode 2 | Total Reward: -897.37 | Avg(10): -984.12 | Epsilon: 0.990 | Time: 0.04s
Episode 3 | Total Reward: -1449.50 | Avg(10): -1139.25 | Epsilon: 0.985 | Time: 0.04s
Episode 4 | Total Reward: -1584.02 | Avg(10): -1250.44 | Epsilon: 0.980 | Time: 0.04s
Episode 5 | Total Reward: -1163.14 | Avg(10): -1232.98 | Epsilon: 0.975 | Time: 0.14s
Episode 6 | Total Reward: -1312.23 | Avg(10): -1246.19 | Epsilon: 0.970 | Time: 6.75s
Episode 7 | Total Reward: -1138.21 | Avg(10): -1230.76 | Epsilon: 0.966 | Time: 6.75s
Episode 8 | Total Reward: -913.25 | Avg(10): -1191.07 | Epsilon: 0.961 | Time: 7.12s
Episode 9 | Total Reward: -959.87 | Avg(10): -1165.38 | Epsilon: 0.956 | Time: 7.11s
Episode 10 | Total Reward: -1530.76 | Avg(10): -1201.92 | Epsilon: 0.951 | Time: 6.94s
Episode 11 | Total Reward: -968.28 | Avg(10): -1191.66 | Epsilon: 0.946 | Time: 6.86s
Episode 12 | Total Reward: -972.53 | Avg(10): -1199.18 | Epsilon: 0.942 | Time: 6.82s
Episode 13 | Total Reward: -1501.05 | Avg(10): -1204.33 | Epsilon: 0.937 | Time: 6.74s
Episode 14 | Total Reward: -1065.64 | Avg(10): -1152.50 | Epsilon: 0.932 | Time: 6.76s
Episode 15 | Total Reward: -1638.46 | Avg(10): -1200.03 | Epsilon: 0.928 | Time: 6.74s
Episode 16 | Total Reward: -1078.52 | Avg(10): -1176.66 | Epsilon: 0.923 | Time: 6.75s
Episode 17 | Total Reward: -1231.01 | Avg(10): -1185.94 | Epsilon: 0.918 | Time: 6.70s
Episode 18 | Total Reward: -1651.87 | Avg(10): -1259.80 | Epsilon: 0.914 | Time: 6.69s
Episode 19 | Total Reward: -1502.63 | Avg(10): -1314.07 | Epsilon: 0.909 | Time: 6.70s
Episode 20 | Total Reward: -1518.85 | Avg(10): -1312.88 | Epsilon: 0.905 | Time: 6.67s
Episode 21 | Total Reward: -844.09 | Avg(10): -1300.46 | Epsilon: 0.900 | Time: 6.71s
Episode 22 | Total Reward: -1246.74 | Avg(10): -1327.89 | Epsilon: 0.896 | Time: 6.76s
Episode 23 | Total Reward: -1189.44 | Avg(10): -1296.72 | Epsilon: 0.891 | Time: 6.83s
Episode 24 | Total Reward: -1183.28 | Avg(10): -1308.49 | Epsilon: 0.887 | Time: 6.71s
Episode 25 | Total Reward: -1719.96 | Avg(10): -1316.64 | Epsilon: 0.882 | Time: 6.79s
Episode 26 | Total Reward: -756.08 | Avg(10): -1284.40 | Epsilon: 0.878 | Time: 6.80s
Episode 27 | Total Reward: -1473.86 | Avg(10): -1308.68 | Epsilon: 0.873 | Time: 6.74s
Episode 28 | Total Reward: -1507.75 | Avg(10): -1294.27 | Epsilon: 0.869 | Time: 6.72s
Episode 29 | Total Reward: -1651.12 | Avg(10): -1309.12 | Epsilon: 0.865 | Time: 6.75s
Episode 30 | Total Reward: -968.59 | Avg(10): -1254.09 | Epsilon: 0.860 | Time: 6.81s
Episode 31 | Total Reward: -1378.01 | Avg(10): -1307.48 | Epsilon: 0.856 | Time: 6.76s
Episode 32 | Total Reward: -1533.15 | Avg(10): -1336.12 | Epsilon: 0.852 | Time: 6.74s
Episode 33 | Total Reward: -1790.85 | Avg(10): -1396.27 | Epsilon: 0.848 | Time: 6.79s
Episode 34 | Total Reward: -1701.48 | Avg(10): -1448.09 | Epsilon: 0.843 | Time: 6.73s
Episode 35 | Total Reward: -1357.63 | Avg(10): -1411.85 | Epsilon: 0.839 | Time: 7.29s
Episode 36 | Total Reward: -1090.30 | Avg(10): -1445.27 | Epsilon: 0.835 | Time: 6.90s
Episode 37 | Total Reward: -903.43 | Avg(10): -1388.23 | Epsilon: 0.831 | Time: 6.81s
Episode 38 | Total Reward: -1077.84 | Avg(10): -1345.24 | Epsilon: 0.827 | Time: 6.86s
Episode 39 | Total Reward: -939.84 | Avg(10): -1274.11 | Epsilon: 0.822 | Time: 6.87s
Episode 40 | Total Reward: -1655.58 | Avg(10): -1342.81 | Epsilon: 0.818 | Time: 6.88s
Episode 41 | Total Reward: -1554.28 | Avg(10): -1360.44 | Epsilon: 0.814 | Time: 6.95s
Episode 42 | Total Reward: -1553.47 | Avg(10): -1362.47 | Epsilon: 0.810 | Time: 6.87s
Episode 43 | Total Reward: -1191.41 | Avg(10): -1302.53 | Epsilon: 0.806 | Time: 6.91s
Episode 44 | Total Reward: -1416.01 | Avg(10): -1273.98 | Epsilon: 0.802 | Time: 7.18s
Episode 45 | Total Reward: -1203.92 | Avg(10): -1258.61 | Epsilon: 0.798 | Time: 7.26s
Episode 46 | Total Reward: -1237.79 | Avg(10): -1273.36 | Epsilon: 0.794 | Time: 6.94s
Episode 47 | Total Reward: -1233.31 | Avg(10): -1306.34 | Epsilon: 0.790 | Time: 7.07s
Episode 48 | Total Reward: -1473.19 | Avg(10): -1345.88 | Epsilon: 0.786 | Time: 6.97s
Episode 49 | Total Reward: -1093.13 | Avg(10): -1361.21 | Epsilon: 0.782 | Time: 6.91s
Episode 50 | Total Reward: -1275.62 | Avg(10): -1323.21 | Epsilon: 0.778 | Time: 6.92s
Episode 51 | Total Reward: -1247.04 | Avg(10): -1292.49 | Epsilon: 0.774 | Time: 6.84s
Episode 52 | Total Reward: -1514.49 | Avg(10): -1288.59 | Epsilon: 0.771 | Time: 7.64s
Episode 53 | Total Reward: -866.95 | Avg(10): -1256.15 | Epsilon: 0.767 | Time: 8.69s
Episode 54 | Total Reward: -1397.97 | Avg(10): -1254.34 | Epsilon: 0.763 | Time: 8.29s
Episode 55 | Total Reward: -963.91 | Avg(10): -1230.34 | Epsilon: 0.759 | Time: 8.11s
Episode 56 | Total Reward: -1194.15 | Avg(10): -1225.98 | Epsilon: 0.755 | Time: 8.12s
Episode 57 | Total Reward: -1304.12 | Avg(10): -1233.06 | Epsilon: 0.751 | Time: 8.18s
Episode 58 | Total Reward: -917.00 | Avg(10): -1177.44 | Epsilon: 0.748 | Time: 8.56s
Episode 59 | Total Reward: -1291.51 | Avg(10): -1197.28 | Epsilon: 0.744 | Time: 8.15s
Episode 60 | Total Reward: -630.99 | Avg(10): -1132.81 | Epsilon: 0.740 | Time: 8.14s
Episode 61 | Total Reward: -888.13 | Avg(10): -1096.92 | Epsilon: 0.737 | Time: 7.99s
Episode 62 | Total Reward: -849.65 | Avg(10): -1030.44 | Epsilon: 0.733 | Time: 8.12s
Episode 63 | Total Reward: -773.66 | Avg(10): -1021.11 | Epsilon: 0.729 | Time: 8.06s
Episode 64 | Total Reward: -740.26 | Avg(10): -955.34 | Epsilon: 0.726 | Time: 8.14s
Episode 65 | Total Reward: -1001.07 | Avg(10): -959.05 | Epsilon: 0.722 | Time: 8.02s
Episode 66 | Total Reward: -874.83 | Avg(10): -927.12 | Epsilon: 0.718 | Time: 8.13s
Episode 67 | Total Reward: -1057.76 | Avg(10): -902.49 | Epsilon: 0.715 | Time: 8.53s
Episode 68 | Total Reward: -805.40 | Avg(10): -891.32 | Epsilon: 0.711 | Time: 8.17s
Episode 69 | Total Reward: -1207.93 | Avg(10): -882.97 | Epsilon: 0.708 | Time: 8.28s
Episode 70 | Total Reward: -1089.86 | Avg(10): -928.85 | Epsilon: 0.704 | Time: 8.19s
Episode 71 | Total Reward: -1137.69 | Avg(10): -953.81 | Epsilon: 0.701 | Time: 8.23s
Episode 72 | Total Reward: -980.34 | Avg(10): -966.88 | Epsilon: 0.697 | Time: 8.28s
Episode 73 | Total Reward: -1006.88 | Avg(10): -990.20 | Epsilon: 0.694 | Time: 8.28s
Episode 74 | Total Reward: -887.76 | Avg(10): -1004.95 | Epsilon: 0.690 | Time: 8.32s
Episode 75 | Total Reward: -766.16 | Avg(10): -981.46 | Epsilon: 0.687 | Time: 8.15s
Episode 76 | Total Reward: -1024.45 | Avg(10): -996.42 | Epsilon: 0.683 | Time: 8.24s
Episode 77 | Total Reward: -905.67 | Avg(10): -981.21 | Epsilon: 0.680 | Time: 8.29s
Episode 78 | Total Reward: -1058.33 | Avg(10): -1006.51 | Epsilon: 0.676 | Time: 8.16s
Episode 79 | Total Reward: -1058.99 | Avg(10): -991.61 | Epsilon: 0.673 | Time: 8.08s
Episode 80 | Total Reward: -791.62 | Avg(10): -961.79 | Epsilon: 0.670 | Time: 8.29s
Episode 81 | Total Reward: -1190.56 | Avg(10): -967.08 | Epsilon: 0.666 | Time: 8.20s
Episode 82 | Total Reward: -931.39 | Avg(10): -962.18 | Epsilon: 0.663 | Time: 8.15s
Episode 83 | Total Reward: -1012.89 | Avg(10): -962.78 | Epsilon: 0.660 | Time: 8.56s
Episode 84 | Total Reward: -1097.21 | Avg(10): -983.73 | Epsilon: 0.656 | Time: 8.32s
Episode 85 | Total Reward: -1075.53 | Avg(10): -1014.66 | Epsilon: 0.653 | Time: 8.44s
Episode 86 | Total Reward: -1195.04 | Avg(10): -1031.72 | Epsilon: 0.650 | Time: 8.40s
Episode 87 | Total Reward: -1039.48 | Avg(10): -1045.10 | Epsilon: 0.647 | Time: 8.48s
Episode 88 | Total Reward: -1093.88 | Avg(10): -1048.66 | Epsilon: 0.643 | Time: 8.61s
Episode 89 | Total Reward: -878.44 | Avg(10): -1030.61 | Epsilon: 0.640 | Time: 8.62s
Episode 90 | Total Reward: -1143.95 | Avg(10): -1065.84 | Epsilon: 0.637 | Time: 8.55s
Episode 91 | Total Reward: -1084.29 | Avg(10): -1055.21 | Epsilon: 0.634 | Time: 8.57s
Episode 92 | Total Reward: -1055.07 | Avg(10): -1067.58 | Epsilon: 0.631 | Time: 8.51s
Episode 93 | Total Reward: -1031.55 | Avg(10): -1069.45 | Epsilon: 0.627 | Time: 8.37s
Episode 94 | Total Reward: -888.34 | Avg(10): -1048.56 | Epsilon: 0.624 | Time: 8.47s
Episode 95 | Total Reward: -1136.06 | Avg(10): -1054.61 | Epsilon: 0.621 | Time: 8.37s
Episode 96 | Total Reward: -1010.98 | Avg(10): -1036.21 | Epsilon: 0.618 | Time: 8.51s
Episode 97 | Total Reward: -882.07 | Avg(10): -1020.46 | Epsilon: 0.615 | Time: 8.37s
Episode 98 | Total Reward: -1245.08 | Avg(10): -1035.58 | Epsilon: 0.612 | Time: 8.34s
Episode 99 | Total Reward: -1276.84 | Avg(10): -1075.42 | Epsilon: 0.609 | Time: 8.44s

--- Episode 100: Action Usage Analysis ---
Action distribution: [0.1318  0.08035 0.08615 0.0816  0.07925 0.0902  0.07285 0.0822  0.0823
 0.0883  0.125  ]
Entropy (diversity): 2.379
--------------------------------------------------
Episode 100 | Total Reward: -1152.03 | Avg(10): -1076.23 | Epsilon: 0.606 | Time: 8.61s
Episode 101 | Total Reward: -1006.95 | Avg(10): -1068.50 | Epsilon: 0.603 | Time: 8.69s
Episode 102 | Total Reward: -870.12 | Avg(10): -1050.00 | Epsilon: 0.600 | Time: 8.70s
Episode 103 | Total Reward: -1057.71 | Avg(10): -1052.62 | Epsilon: 0.597 | Time: 9.58s
Episode 104 | Total Reward: -1015.88 | Avg(10): -1065.37 | Epsilon: 0.594 | Time: 9.21s
Episode 105 | Total Reward: -920.68 | Avg(10): -1043.83 | Epsilon: 0.591 | Time: 8.57s
Episode 106 | Total Reward: -1026.98 | Avg(10): -1045.43 | Epsilon: 0.588 | Time: 8.54s
Episode 107 | Total Reward: -911.33 | Avg(10): -1048.36 | Epsilon: 0.585 | Time: 8.55s
Episode 108 | Total Reward: -1052.51 | Avg(10): -1029.10 | Epsilon: 0.582 | Time: 8.72s
Episode 109 | Total Reward: -1141.49 | Avg(10): -1015.57 | Epsilon: 0.579 | Time: 8.77s
Episode 110 | Total Reward: -1036.23 | Avg(10): -1003.99 | Epsilon: 0.576 | Time: 8.46s
Episode 111 | Total Reward: -1184.65 | Avg(10): -1021.76 | Epsilon: 0.573 | Time: 8.35s
Episode 112 | Total Reward: -916.27 | Avg(10): -1026.37 | Epsilon: 0.570 | Time: 9.48s
Episode 113 | Total Reward: -1140.83 | Avg(10): -1034.68 | Epsilon: 0.568 | Time: 8.58s
Episode 114 | Total Reward: -999.24 | Avg(10): -1033.02 | Epsilon: 0.565 | Time: 8.63s
Episode 115 | Total Reward: -865.97 | Avg(10): -1027.55 | Epsilon: 0.562 | Time: 8.67s
Episode 116 | Total Reward: -1121.12 | Avg(10): -1036.96 | Epsilon: 0.559 | Time: 8.60s
Episode 117 | Total Reward: -1022.14 | Avg(10): -1048.04 | Epsilon: 0.556 | Time: 8.69s
Episode 118 | Total Reward: -997.58 | Avg(10): -1042.55 | Epsilon: 0.554 | Time: 8.59s
Episode 119 | Total Reward: -645.15 | Avg(10): -992.92 | Epsilon: 0.551 | Time: 8.49s
Episode 120 | Total Reward: -1149.20 | Avg(10): -1004.21 | Epsilon: 0.548 | Time: 8.63s
Episode 121 | Total Reward: -1113.02 | Avg(10): -997.05 | Epsilon: 0.545 | Time: 8.55s
Episode 122 | Total Reward: -230.17 | Avg(10): -928.44 | Epsilon: 0.543 | Time: 8.54s
Episode 123 | Total Reward: -1088.84 | Avg(10): -923.24 | Epsilon: 0.540 | Time: 8.55s
Episode 124 | Total Reward: -774.33 | Avg(10): -900.75 | Epsilon: 0.537 | Time: 8.67s
Episode 125 | Total Reward: -1115.72 | Avg(10): -925.73 | Epsilon: 0.534 | Time: 8.66s
Episode 126 | Total Reward: -1172.15 | Avg(10): -930.83 | Epsilon: 0.532 | Time: 8.44s
Episode 127 | Total Reward: -652.34 | Avg(10): -893.85 | Epsilon: 0.529 | Time: 8.49s
Episode 128 | Total Reward: -994.44 | Avg(10): -893.53 | Epsilon: 0.526 | Time: 8.62s
Episode 129 | Total Reward: -1037.78 | Avg(10): -932.80 | Epsilon: 0.524 | Time: 8.59s
Episode 130 | Total Reward: -997.56 | Avg(10): -917.63 | Epsilon: 0.521 | Time: 8.62s
Episode 131 | Total Reward: -973.36 | Avg(10): -903.67 | Epsilon: 0.519 | Time: 8.54s
Episode 132 | Total Reward: -884.57 | Avg(10): -969.11 | Epsilon: 0.516 | Time: 8.61s
Episode 133 | Total Reward: -873.55 | Avg(10): -947.58 | Epsilon: 0.513 | Time: 8.66s
Episode 134 | Total Reward: -1129.76 | Avg(10): -983.12 | Epsilon: 0.511 | Time: 8.42s
Episode 135 | Total Reward: -990.25 | Avg(10): -970.58 | Epsilon: 0.508 | Time: 8.43s
Episode 136 | Total Reward: -842.86 | Avg(10): -937.65 | Epsilon: 0.506 | Time: 8.65s
Episode 137 | Total Reward: -847.73 | Avg(10): -957.19 | Epsilon: 0.503 | Time: 8.53s
Episode 138 | Total Reward: -902.67 | Avg(10): -948.01 | Epsilon: 0.501 | Time: 8.36s
Episode 139 | Total Reward: -719.38 | Avg(10): -916.17 | Epsilon: 0.498 | Time: 8.04s
Episode 140 | Total Reward: -478.52 | Avg(10): -864.26 | Epsilon: 0.496 | Time: 7.16s
Episode 141 | Total Reward: -515.27 | Avg(10): -818.45 | Epsilon: 0.493 | Time: 7.12s
Episode 142 | Total Reward: -536.28 | Avg(10): -783.63 | Epsilon: 0.491 | Time: 7.11s
Episode 143 | Total Reward: -774.54 | Avg(10): -773.72 | Epsilon: 0.488 | Time: 7.20s
Episode 144 | Total Reward: -604.27 | Avg(10): -721.18 | Epsilon: 0.486 | Time: 7.06s
Episode 145 | Total Reward: -515.28 | Avg(10): -673.68 | Epsilon: 0.483 | Time: 7.41s
Episode 146 | Total Reward: -509.90 | Avg(10): -640.38 | Epsilon: 0.481 | Time: 7.26s
Episode 147 | Total Reward: -919.57 | Avg(10): -647.57 | Epsilon: 0.479 | Time: 7.37s
Episode 148 | Total Reward: -767.82 | Avg(10): -634.08 | Epsilon: 0.476 | Time: 7.21s
Episode 149 | Total Reward: -1110.23 | Avg(10): -673.17 | Epsilon: 0.474 | Time: 7.20s
Episode 150 | Total Reward: -826.14 | Avg(10): -707.93 | Epsilon: 0.471 | Time: 7.08s
Episode 151 | Total Reward: -902.33 | Avg(10): -746.64 | Epsilon: 0.469 | Time: 7.14s
Episode 152 | Total Reward: -733.80 | Avg(10): -766.39 | Epsilon: 0.467 | Time: 7.16s
Episode 153 | Total Reward: -806.10 | Avg(10): -769.54 | Epsilon: 0.464 | Time: 7.17s
Episode 154 | Total Reward: -888.76 | Avg(10): -797.99 | Epsilon: 0.462 | Time: 7.15s
Episode 155 | Total Reward: -866.78 | Avg(10): -833.14 | Epsilon: 0.460 | Time: 7.15s
Episode 156 | Total Reward: -651.82 | Avg(10): -847.34 | Epsilon: 0.458 | Time: 7.06s
Episode 157 | Total Reward: -1004.58 | Avg(10): -855.84 | Epsilon: 0.455 | Time: 7.09s
Episode 158 | Total Reward: -634.80 | Avg(10): -842.53 | Epsilon: 0.453 | Time: 7.13s
Episode 159 | Total Reward: -505.70 | Avg(10): -782.08 | Epsilon: 0.451 | Time: 7.49s
Episode 160 | Total Reward: -635.26 | Avg(10): -762.99 | Epsilon: 0.448 | Time: 7.23s
Episode 161 | Total Reward: -467.39 | Avg(10): -719.50 | Epsilon: 0.446 | Time: 7.13s
Episode 162 | Total Reward: -534.18 | Avg(10): -699.54 | Epsilon: 0.444 | Time: 7.23s
Episode 163 | Total Reward: -631.42 | Avg(10): -682.07 | Epsilon: 0.442 | Time: 7.71s
Episode 164 | Total Reward: -698.34 | Avg(10): -663.03 | Epsilon: 0.440 | Time: 155.83s
Episode 165 | Total Reward: -494.31 | Avg(10): -625.78 | Epsilon: 0.437 | Time: 7.45s
Episode 166 | Total Reward: -682.42 | Avg(10): -628.84 | Epsilon: 0.435 | Time: 8.03s
Episode 167 | Total Reward: -951.40 | Avg(10): -623.52 | Epsilon: 0.433 | Time: 7.66s
Episode 168 | Total Reward: -735.32 | Avg(10): -633.57 | Epsilon: 0.431 | Time: 11.39s
Episode 169 | Total Reward: -634.29 | Avg(10): -646.43 | Epsilon: 0.429 | Time: 7.53s
Episode 170 | Total Reward: -506.44 | Avg(10): -633.55 | Epsilon: 0.427 | Time: 7.93s
Episode 171 | Total Reward: -618.45 | Avg(10): -648.66 | Epsilon: 0.424 | Time: 7.31s
Episode 172 | Total Reward: -751.91 | Avg(10): -670.43 | Epsilon: 0.422 | Time: 7.36s
Episode 173 | Total Reward: -503.64 | Avg(10): -657.65 | Epsilon: 0.420 | Time: 7.44s
Episode 174 | Total Reward: -505.08 | Avg(10): -638.33 | Epsilon: 0.418 | Time: 11.73s
Episode 175 | Total Reward: -623.95 | Avg(10): -651.29 | Epsilon: 0.416 | Time: 8.27s
Episode 176 | Total Reward: -501.63 | Avg(10): -633.21 | Epsilon: 0.414 | Time: 7.79s
Episode 177 | Total Reward: -254.20 | Avg(10): -563.49 | Epsilon: 0.412 | Time: 8.51s
Episode 178 | Total Reward: -498.79 | Avg(10): -539.84 | Epsilon: 0.410 | Time: 7.68s
Episode 179 | Total Reward: -379.61 | Avg(10): -514.37 | Epsilon: 0.408 | Time: 7.75s
Episode 180 | Total Reward: -251.96 | Avg(10): -488.92 | Epsilon: 0.406 | Time: 7.65s
Episode 181 | Total Reward: -505.36 | Avg(10): -477.61 | Epsilon: 0.404 | Time: 7.51s
Episode 182 | Total Reward: -641.94 | Avg(10): -466.62 | Epsilon: 0.402 | Time: 8.03s
Episode 183 | Total Reward: -948.62 | Avg(10): -511.11 | Epsilon: 0.400 | Time: 7.59s
Episode 184 | Total Reward: -253.78 | Avg(10): -485.98 | Epsilon: 0.398 | Time: 7.60s
Episode 185 | Total Reward: -378.89 | Avg(10): -461.48 | Epsilon: 0.396 | Time: 7.61s
Episode 186 | Total Reward: -506.53 | Avg(10): -461.97 | Epsilon: 0.394 | Time: 7.63s
Episode 187 | Total Reward: -503.01 | Avg(10): -486.85 | Epsilon: 0.392 | Time: 7.69s
Episode 188 | Total Reward: -258.51 | Avg(10): -462.82 | Epsilon: 0.390 | Time: 7.68s
Episode 189 | Total Reward: -497.14 | Avg(10): -474.57 | Epsilon: 0.388 | Time: 7.75s
Episode 190 | Total Reward: -252.22 | Avg(10): -474.60 | Epsilon: 0.386 | Time: 7.76s
Episode 191 | Total Reward: -607.60 | Avg(10): -484.82 | Epsilon: 0.384 | Time: 7.71s
Episode 192 | Total Reward: -494.71 | Avg(10): -470.10 | Epsilon: 0.382 | Time: 7.75s
Episode 193 | Total Reward: -379.53 | Avg(10): -413.19 | Epsilon: 0.380 | Time: 7.57s
Episode 194 | Total Reward: -736.89 | Avg(10): -461.50 | Epsilon: 0.378 | Time: 7.66s
Episode 195 | Total Reward: -250.39 | Avg(10): -448.65 | Epsilon: 0.376 | Time: 7.61s
Episode 196 | Total Reward: -252.11 | Avg(10): -423.21 | Epsilon: 0.374 | Time: 9.86s
Episode 197 | Total Reward: -619.11 | Avg(10): -434.82 | Epsilon: 0.373 | Time: 7.69s
Episode 198 | Total Reward: -248.13 | Avg(10): -433.78 | Epsilon: 0.371 | Time: 7.75s
Episode 199 | Total Reward: -376.13 | Avg(10): -421.68 | Epsilon: 0.369 | Time: 7.49s

--- Episode 200: Action Usage Analysis ---
Action distribution: [0.1788  0.0802  0.0671  0.0505  0.0496  0.07025 0.05005 0.05485 0.0682
 0.2365  0.09395]
Entropy (diversity): 2.233
--------------------------------------------------
Episode 200 | Total Reward: -254.74 | Avg(10): -421.93 | Epsilon: 0.367 | Time: 7.64s
Episode 201 | Total Reward: -823.80 | Avg(10): -443.55 | Epsilon: 0.365 | Time: 7.53s
Episode 202 | Total Reward: -400.20 | Avg(10): -434.10 | Epsilon: 0.363 | Time: 7.02s
Episode 203 | Total Reward: -254.02 | Avg(10): -421.55 | Epsilon: 0.361 | Time: 85.64s
Episode 204 | Total Reward: -492.31 | Avg(10): -397.09 | Epsilon: 0.360 | Time: 7.29s
Episode 205 | Total Reward: -484.61 | Avg(10): -420.52 | Epsilon: 0.358 | Time: 8.03s
Episode 206 | Total Reward: -641.60 | Avg(10): -459.47 | Epsilon: 0.356 | Time: 7.71s
Episode 207 | Total Reward: -380.88 | Avg(10): -435.64 | Epsilon: 0.354 | Time: 7.41s
Episode 208 | Total Reward: -487.34 | Avg(10): -459.56 | Epsilon: 0.353 | Time: 7.09s
Episode 209 | Total Reward: -498.76 | Avg(10): -471.83 | Epsilon: 0.351 | Time: 7.21s
Episode 210 | Total Reward: -376.62 | Avg(10): -484.02 | Epsilon: 0.349 | Time: 7.23s
Episode 211 | Total Reward: -492.20 | Avg(10): -450.86 | Epsilon: 0.347 | Time: 7.29s
Episode 212 | Total Reward: -369.73 | Avg(10): -447.81 | Epsilon: 0.346 | Time: 7.25s
Episode 213 | Total Reward: -501.35 | Avg(10): -472.54 | Epsilon: 0.344 | Time: 7.18s
Episode 214 | Total Reward: -490.00 | Avg(10): -472.31 | Epsilon: 0.342 | Time: 7.12s
Episode 215 | Total Reward: -500.40 | Avg(10): -473.89 | Epsilon: 0.340 | Time: 7.09s
Episode 216 | Total Reward: -377.43 | Avg(10): -447.47 | Epsilon: 0.339 | Time: 7.05s
Episode 217 | Total Reward: -249.18 | Avg(10): -434.30 | Epsilon: 0.337 | Time: 7.16s
Episode 218 | Total Reward: -249.00 | Avg(10): -410.47 | Epsilon: 0.335 | Time: 7.16s
Episode 219 | Total Reward: -519.87 | Avg(10): -412.58 | Epsilon: 0.334 | Time: 7.20s
Episode 220 | Total Reward: -372.80 | Avg(10): -412.20 | Epsilon: 0.332 | Time: 7.24s
Episode 221 | Total Reward: -490.03 | Avg(10): -411.98 | Epsilon: 0.330 | Time: 7.13s
Episode 222 | Total Reward: -252.41 | Avg(10): -400.25 | Epsilon: 0.329 | Time: 7.14s
Episode 223 | Total Reward: -252.44 | Avg(10): -375.36 | Epsilon: 0.327 | Time: 7.18s
Episode 224 | Total Reward: -360.55 | Avg(10): -362.41 | Epsilon: 0.325 | Time: 7.17s
Episode 225 | Total Reward: -624.17 | Avg(10): -374.79 | Epsilon: 0.324 | Time: 7.27s
Episode 226 | Total Reward: -623.85 | Avg(10): -399.43 | Epsilon: 0.322 | Time: 7.25s
Episode 227 | Total Reward: -377.11 | Avg(10): -412.22 | Epsilon: 0.321 | Time: 7.21s
Episode 228 | Total Reward: -251.22 | Avg(10): -412.44 | Epsilon: 0.319 | Time: 7.34s
Episode 229 | Total Reward: -245.78 | Avg(10): -385.04 | Epsilon: 0.317 | Time: 7.26s
Episode 230 | Total Reward: -364.79 | Avg(10): -384.23 | Epsilon: 0.316 | Time: 7.36s
Episode 231 | Total Reward: -608.55 | Avg(10): -396.09 | Epsilon: 0.314 | Time: 7.31s
Episode 232 | Total Reward: -471.39 | Avg(10): -417.98 | Epsilon: 0.313 | Time: 6.13s
Episode 233 | Total Reward: -124.04 | Avg(10): -405.15 | Epsilon: 0.311 | Time: 7.31s
Episode 234 | Total Reward: -127.64 | Avg(10): -381.85 | Epsilon: 0.309 | Time: 78.77s
Episode 235 | Total Reward: -2.37 | Avg(10): -319.67 | Epsilon: 0.308 | Time: 6.54s
Episode 236 | Total Reward: -366.47 | Avg(10): -293.94 | Epsilon: 0.306 | Time: 7.83s
Episode 237 | Total Reward: -376.91 | Avg(10): -293.92 | Epsilon: 0.305 | Time: 7.68s
Episode 238 | Total Reward: -252.96 | Avg(10): -294.09 | Epsilon: 0.303 | Time: 7.51s
Episode 239 | Total Reward: -133.59 | Avg(10): -282.87 | Epsilon: 0.302 | Time: 7.38s
Episode 240 | Total Reward: -481.18 | Avg(10): -294.51 | Epsilon: 0.300 | Time: 7.35s
Episode 241 | Total Reward: -376.71 | Avg(10): -271.33 | Epsilon: 0.299 | Time: 7.32s
Episode 242 | Total Reward: -344.12 | Avg(10): -258.60 | Epsilon: 0.297 | Time: 7.41s
Episode 243 | Total Reward: -384.43 | Avg(10): -284.64 | Epsilon: 0.296 | Time: 9.78s
Episode 244 | Total Reward: -505.34 | Avg(10): -322.41 | Epsilon: 0.294 | Time: 9.69s
Episode 245 | Total Reward: -149.05 | Avg(10): -337.08 | Epsilon: 0.293 | Time: 13.06s
Episode 246 | Total Reward: -233.47 | Avg(10): -323.77 | Epsilon: 0.291 | Time: 11.66s
Episode 247 | Total Reward: -2.70 | Avg(10): -286.35 | Epsilon: 0.290 | Time: 7.77s
Episode 248 | Total Reward: -132.74 | Avg(10): -274.33 | Epsilon: 0.288 | Time: 11.02s
Episode 249 | Total Reward: -501.40 | Avg(10): -311.11 | Epsilon: 0.287 | Time: 7.62s
Episode 250 | Total Reward: -247.88 | Avg(10): -287.78 | Epsilon: 0.286 | Time: 7.55s
Episode 251 | Total Reward: -379.13 | Avg(10): -288.02 | Epsilon: 0.284 | Time: 7.42s
Episode 252 | Total Reward: -278.55 | Avg(10): -281.47 | Epsilon: 0.283 | Time: 9.59s
Episode 253 | Total Reward: -366.33 | Avg(10): -279.66 | Epsilon: 0.281 | Time: 12.82s
Episode 254 | Total Reward: -263.08 | Avg(10): -255.43 | Epsilon: 0.280 | Time: 7.78s
Episode 255 | Total Reward: -252.73 | Avg(10): -265.80 | Epsilon: 0.279 | Time: 7.28s
Episode 256 | Total Reward: -367.86 | Avg(10): -279.24 | Epsilon: 0.277 | Time: 7.26s
Episode 257 | Total Reward: -122.66 | Avg(10): -291.24 | Epsilon: 0.276 | Time: 7.25s
Episode 258 | Total Reward: -377.64 | Avg(10): -315.73 | Epsilon: 0.274 | Time: 11.36s
Episode 259 | Total Reward: -370.74 | Avg(10): -302.66 | Epsilon: 0.273 | Time: 12.96s
Episode 260 | Total Reward: -369.94 | Avg(10): -314.87 | Epsilon: 0.272 | Time: 8.53s
Episode 261 | Total Reward: -369.38 | Avg(10): -313.89 | Epsilon: 0.270 | Time: 7.90s
Episode 262 | Total Reward: -375.98 | Avg(10): -323.63 | Epsilon: 0.269 | Time: 8.00s
Episode 263 | Total Reward: -371.59 | Avg(10): -324.16 | Epsilon: 0.268 | Time: 7.86s
Episode 264 | Total Reward: -251.66 | Avg(10): -323.02 | Epsilon: 0.266 | Time: 7.67s
Episode 265 | Total Reward: -377.33 | Avg(10): -335.48 | Epsilon: 0.265 | Time: 7.63s
Episode 266 | Total Reward: -247.39 | Avg(10): -323.43 | Epsilon: 0.264 | Time: 7.56s
Episode 267 | Total Reward: -2.37 | Avg(10): -311.40 | Epsilon: 0.262 | Time: 7.51s
Episode 268 | Total Reward: -125.84 | Avg(10): -286.22 | Epsilon: 0.261 | Time: 7.74s
Episode 269 | Total Reward: -377.10 | Avg(10): -286.86 | Epsilon: 0.260 | Time: 8.24s
Episode 270 | Total Reward: -357.79 | Avg(10): -285.64 | Epsilon: 0.258 | Time: 7.62s
Episode 271 | Total Reward: -234.76 | Avg(10): -272.18 | Epsilon: 0.257 | Time: 7.49s
Episode 272 | Total Reward: -248.84 | Avg(10): -259.47 | Epsilon: 0.256 | Time: 7.79s
Episode 273 | Total Reward: -122.17 | Avg(10): -234.52 | Epsilon: 0.255 | Time: 7.87s
Episode 274 | Total Reward: -493.30 | Avg(10): -258.69 | Epsilon: 0.253 | Time: 7.69s
Episode 275 | Total Reward: -242.18 | Avg(10): -245.17 | Epsilon: 0.252 | Time: 7.74s
Episode 276 | Total Reward: -231.09 | Avg(10): -243.54 | Epsilon: 0.251 | Time: 9.42s
Episode 277 | Total Reward: -124.29 | Avg(10): -255.74 | Epsilon: 0.249 | Time: 8.67s
Episode 278 | Total Reward: -497.32 | Avg(10): -292.88 | Epsilon: 0.248 | Time: 8.14s
Episode 279 | Total Reward: -2.60 | Avg(10): -255.43 | Epsilon: 0.247 | Time: 7.95s
Episode 280 | Total Reward: -244.74 | Avg(10): -244.13 | Epsilon: 0.246 | Time: 9.08s
Episode 281 | Total Reward: -490.13 | Avg(10): -269.67 | Epsilon: 0.245 | Time: 8.42s
Episode 282 | Total Reward: -235.53 | Avg(10): -268.34 | Epsilon: 0.243 | Time: 9.30s
Episode 283 | Total Reward: -1.99 | Avg(10): -256.32 | Epsilon: 0.242 | Time: 11.53s
Episode 284 | Total Reward: -128.12 | Avg(10): -219.80 | Epsilon: 0.241 | Time: 10.80s
Episode 285 | Total Reward: -358.53 | Avg(10): -231.43 | Epsilon: 0.240 | Time: 8.89s
Episode 286 | Total Reward: -397.41 | Avg(10): -248.07 | Epsilon: 0.238 | Time: 7.74s
Episode 287 | Total Reward: -257.07 | Avg(10): -261.34 | Epsilon: 0.237 | Time: 7.67s
Episode 288 | Total Reward: -1.49 | Avg(10): -211.76 | Epsilon: 0.236 | Time: 7.81s
Episode 289 | Total Reward: -126.32 | Avg(10): -224.13 | Epsilon: 0.235 | Time: 7.80s
Episode 290 | Total Reward: -249.83 | Avg(10): -224.64 | Epsilon: 0.234 | Time: 10.80s
Episode 291 | Total Reward: -125.82 | Avg(10): -188.21 | Epsilon: 0.233 | Time: 10.24s
Episode 292 | Total Reward: -242.96 | Avg(10): -188.95 | Epsilon: 0.231 | Time: 11.22s
Episode 293 | Total Reward: -377.15 | Avg(10): -226.47 | Epsilon: 0.230 | Time: 7.72s
Episode 294 | Total Reward: -253.63 | Avg(10): -239.02 | Epsilon: 0.229 | Time: 7.55s
Episode 295 | Total Reward: -122.92 | Avg(10): -215.46 | Epsilon: 0.228 | Time: 7.59s
Episode 296 | Total Reward: -125.36 | Avg(10): -188.26 | Epsilon: 0.227 | Time: 7.50s
Episode 297 | Total Reward: -470.68 | Avg(10): -209.62 | Epsilon: 0.226 | Time: 7.55s
Episode 298 | Total Reward: -4.82 | Avg(10): -209.95 | Epsilon: 0.225 | Time: 7.94s
Episode 299 | Total Reward: -126.20 | Avg(10): -209.94 | Epsilon: 0.223 | Time: 7.67s

--- Episode 300: Action Usage Analysis ---
Action distribution: [0.17545 0.0567  0.04965 0.0427  0.06135 0.06315 0.06345 0.06025 0.0673
 0.25345 0.10655]
Entropy (diversity): 2.210
--------------------------------------------------
Episode 300 | Total Reward: -126.01 | Avg(10): -197.56 | Epsilon: 0.222 | Time: 7.89s
Episode 301 | Total Reward: -120.23 | Avg(10): -197.00 | Epsilon: 0.221 | Time: 12.21s
Episode 302 | Total Reward: -121.32 | Avg(10): -184.83 | Epsilon: 0.220 | Time: 8.19s
Episode 303 | Total Reward: -245.11 | Avg(10): -171.63 | Epsilon: 0.219 | Time: 8.28s
Episode 304 | Total Reward: -247.20 | Avg(10): -170.99 | Epsilon: 0.218 | Time: 8.25s
Episode 305 | Total Reward: -202.07 | Avg(10): -178.90 | Epsilon: 0.217 | Time: 8.19s
Episode 306 | Total Reward: -1.58 | Avg(10): -166.52 | Epsilon: 0.216 | Time: 8.13s
Episode 307 | Total Reward: -123.78 | Avg(10): -131.83 | Epsilon: 0.215 | Time: 8.11s
Episode 308 | Total Reward: -124.12 | Avg(10): -143.76 | Epsilon: 0.214 | Time: 10.22s
Episode 309 | Total Reward: -283.58 | Avg(10): -159.50 | Epsilon: 0.212 | Time: 7.94s
Episode 310 | Total Reward: -241.25 | Avg(10): -171.02 | Epsilon: 0.211 | Time: 7.89s
Episode 311 | Total Reward: -250.41 | Avg(10): -184.04 | Epsilon: 0.210 | Time: 7.74s
Episode 312 | Total Reward: -340.56 | Avg(10): -205.97 | Epsilon: 0.209 | Time: 8.03s
Episode 313 | Total Reward: -121.78 | Avg(10): -193.63 | Epsilon: 0.208 | Time: 7.82s
Episode 314 | Total Reward: -148.83 | Avg(10): -183.80 | Epsilon: 0.207 | Time: 7.84s
Episode 315 | Total Reward: -126.22 | Avg(10): -176.21 | Epsilon: 0.206 | Time: 8.08s
Episode 316 | Total Reward: -123.28 | Avg(10): -188.38 | Epsilon: 0.205 | Time: 7.96s
Episode 317 | Total Reward: -369.77 | Avg(10): -212.98 | Epsilon: 0.204 | Time: 8.01s
Episode 318 | Total Reward: -122.33 | Avg(10): -212.80 | Epsilon: 0.203 | Time: 8.16s
Episode 319 | Total Reward: -243.94 | Avg(10): -208.84 | Epsilon: 0.202 | Time: 8.08s
Episode 320 | Total Reward: -1.27 | Avg(10): -184.84 | Epsilon: 0.201 | Time: 7.96s
Episode 321 | Total Reward: -126.01 | Avg(10): -172.40 | Epsilon: 0.200 | Time: 7.85s
Episode 322 | Total Reward: -122.74 | Avg(10): -150.62 | Epsilon: 0.199 | Time: 10.64s
Episode 323 | Total Reward: -527.98 | Avg(10): -191.24 | Epsilon: 0.198 | Time: 11.10s
Episode 324 | Total Reward: -125.85 | Avg(10): -188.94 | Epsilon: 0.197 | Time: 7.78s
Episode 325 | Total Reward: -123.96 | Avg(10): -188.71 | Epsilon: 0.196 | Time: 10.81s
Episode 326 | Total Reward: -355.91 | Avg(10): -211.98 | Epsilon: 0.195 | Time: 12.25s
Episode 327 | Total Reward: -115.00 | Avg(10): -186.50 | Epsilon: 0.194 | Time: 8.18s
Episode 328 | Total Reward: -243.00 | Avg(10): -198.57 | Epsilon: 0.193 | Time: 7.84s
Episode 329 | Total Reward: -561.85 | Avg(10): -230.36 | Epsilon: 0.192 | Time: 7.91s
Episode 330 | Total Reward: -123.30 | Avg(10): -242.56 | Epsilon: 0.191 | Time: 7.91s
Episode 331 | Total Reward: -124.10 | Avg(10): -242.37 | Epsilon: 0.190 | Time: 7.96s
Episode 332 | Total Reward: -123.63 | Avg(10): -242.46 | Epsilon: 0.189 | Time: 7.95s
Episode 333 | Total Reward: -366.85 | Avg(10): -226.35 | Epsilon: 0.188 | Time: 7.94s
Episode 334 | Total Reward: -259.55 | Avg(10): -239.72 | Epsilon: 0.187 | Time: 7.90s
Episode 335 | Total Reward: -118.29 | Avg(10): -239.15 | Epsilon: 0.187 | Time: 8.00s
Episode 336 | Total Reward: -124.09 | Avg(10): -215.97 | Epsilon: 0.186 | Time: 7.90s
Episode 337 | Total Reward: -121.16 | Avg(10): -216.58 | Epsilon: 0.185 | Time: 7.76s
Episode 338 | Total Reward: -124.37 | Avg(10): -204.72 | Epsilon: 0.184 | Time: 7.75s
Episode 339 | Total Reward: -389.02 | Avg(10): -187.44 | Epsilon: 0.183 | Time: 7.74s
Episode 340 | Total Reward: -246.46 | Avg(10): -199.75 | Epsilon: 0.182 | Time: 7.63s
Episode 341 | Total Reward: -244.08 | Avg(10): -211.75 | Epsilon: 0.181 | Time: 7.48s
Episode 342 | Total Reward: -126.07 | Avg(10): -211.99 | Epsilon: 0.180 | Time: 7.52s
Episode 343 | Total Reward: -123.39 | Avg(10): -187.65 | Epsilon: 0.179 | Time: 8.35s
Episode 344 | Total Reward: -243.03 | Avg(10): -186.00 | Epsilon: 0.178 | Time: 7.97s
Episode 345 | Total Reward: -117.50 | Avg(10): -185.92 | Epsilon: 0.177 | Time: 7.96s
Episode 346 | Total Reward: -124.89 | Avg(10): -186.00 | Epsilon: 0.177 | Time: 7.91s
Episode 347 | Total Reward: -237.07 | Avg(10): -197.59 | Epsilon: 0.176 | Time: 10.29s
Episode 348 | Total Reward: -360.57 | Avg(10): -221.21 | Epsilon: 0.175 | Time: 8.18s
Episode 349 | Total Reward: -362.72 | Avg(10): -218.58 | Epsilon: 0.174 | Time: 8.16s
Episode 350 | Total Reward: -124.98 | Avg(10): -206.43 | Epsilon: 0.173 | Time: 7.83s
Episode 351 | Total Reward: -239.78 | Avg(10): -206.00 | Epsilon: 0.172 | Time: 7.96s
Episode 352 | Total Reward: -1.89 | Avg(10): -193.58 | Epsilon: 0.171 | Time: 8.34s
Episode 353 | Total Reward: -1.92 | Avg(10): -181.43 | Epsilon: 0.170 | Time: 7.85s
Episode 354 | Total Reward: -124.36 | Avg(10): -169.57 | Epsilon: 0.170 | Time: 8.07s
Episode 355 | Total Reward: -122.23 | Avg(10): -170.04 | Epsilon: 0.169 | Time: 7.78s
Episode 356 | Total Reward: -124.16 | Avg(10): -169.97 | Epsilon: 0.168 | Time: 7.59s
Episode 357 | Total Reward: -241.25 | Avg(10): -170.39 | Epsilon: 0.167 | Time: 7.64s
Episode 358 | Total Reward: -124.33 | Avg(10): -146.76 | Epsilon: 0.166 | Time: 7.68s
Episode 359 | Total Reward: -2.47 | Avg(10): -110.74 | Epsilon: 0.165 | Time: 8.20s
Episode 360 | Total Reward: -238.31 | Avg(10): -122.07 | Epsilon: 0.165 | Time: 7.87s
Episode 361 | Total Reward: -243.81 | Avg(10): -122.47 | Epsilon: 0.164 | Time: 7.93s
Episode 362 | Total Reward: -255.57 | Avg(10): -147.84 | Epsilon: 0.163 | Time: 8.28s
Episode 363 | Total Reward: -246.41 | Avg(10): -172.29 | Epsilon: 0.162 | Time: 8.33s
Episode 364 | Total Reward: -234.97 | Avg(10): -183.35 | Epsilon: 0.161 | Time: 7.62s
Episode 365 | Total Reward: -4.60 | Avg(10): -171.59 | Epsilon: 0.160 | Time: 7.43s
Episode 366 | Total Reward: -122.25 | Avg(10): -171.40 | Epsilon: 0.160 | Time: 7.43s
Episode 367 | Total Reward: -128.82 | Avg(10): -160.15 | Epsilon: 0.159 | Time: 7.70s
Episode 368 | Total Reward: -337.81 | Avg(10): -181.50 | Epsilon: 0.158 | Time: 7.66s
Episode 369 | Total Reward: -6.04 | Avg(10): -181.86 | Epsilon: 0.157 | Time: 7.74s
Episode 370 | Total Reward: -248.24 | Avg(10): -182.85 | Epsilon: 0.157 | Time: 7.61s
Episode 371 | Total Reward: -2.60 | Avg(10): -158.73 | Epsilon: 0.156 | Time: 7.42s
Episode 372 | Total Reward: -1.45 | Avg(10): -133.32 | Epsilon: 0.155 | Time: 7.63s
Episode 373 | Total Reward: -123.15 | Avg(10): -120.99 | Epsilon: 0.154 | Time: 7.60s
Episode 374 | Total Reward: -128.46 | Avg(10): -110.34 | Epsilon: 0.153 | Time: 7.83s
Episode 375 | Total Reward: -116.44 | Avg(10): -121.53 | Epsilon: 0.153 | Time: 7.54s
Episode 376 | Total Reward: -124.43 | Avg(10): -121.74 | Epsilon: 0.152 | Time: 7.36s
Episode 377 | Total Reward: -456.82 | Avg(10): -154.54 | Epsilon: 0.151 | Time: 7.39s
Episode 378 | Total Reward: -243.69 | Avg(10): -145.13 | Epsilon: 0.150 | Time: 7.48s
Episode 379 | Total Reward: -127.26 | Avg(10): -157.25 | Epsilon: 0.150 | Time: 7.48s
Episode 380 | Total Reward: -375.98 | Avg(10): -170.03 | Epsilon: 0.149 | Time: 7.49s
Episode 381 | Total Reward: -6.29 | Avg(10): -170.40 | Epsilon: 0.148 | Time: 7.35s
Episode 382 | Total Reward: -129.04 | Avg(10): -183.16 | Epsilon: 0.147 | Time: 7.25s
Episode 383 | Total Reward: -246.31 | Avg(10): -195.47 | Epsilon: 0.147 | Time: 7.31s
Episode 384 | Total Reward: -348.21 | Avg(10): -217.45 | Epsilon: 0.146 | Time: 7.32s
Episode 385 | Total Reward: -127.38 | Avg(10): -218.54 | Epsilon: 0.145 | Time: 5.33s
Episode 386 | Total Reward: -378.88 | Avg(10): -243.99 | Epsilon: 0.144 | Time: 7.33s
Episode 387 | Total Reward: -127.48 | Avg(10): -211.05 | Epsilon: 0.144 | Time: 7.31s
Episode 388 | Total Reward: -1.53 | Avg(10): -186.84 | Epsilon: 0.143 | Time: 7.48s
Episode 389 | Total Reward: -232.36 | Avg(10): -197.35 | Epsilon: 0.142 | Time: 7.46s
Episode 390 | Total Reward: -252.55 | Avg(10): -185.00 | Epsilon: 0.142 | Time: 10.53s
Episode 391 | Total Reward: -232.84 | Avg(10): -207.66 | Epsilon: 0.141 | Time: 8.27s
Episode 392 | Total Reward: -233.99 | Avg(10): -218.15 | Epsilon: 0.140 | Time: 7.87s
Episode 393 | Total Reward: -120.74 | Avg(10): -205.60 | Epsilon: 0.139 | Time: 10.41s
Episode 394 | Total Reward: -3.07 | Avg(10): -171.08 | Epsilon: 0.139 | Time: 7.95s
Episode 395 | Total Reward: -124.04 | Avg(10): -170.75 | Epsilon: 0.138 | Time: 7.71s
Episode 396 | Total Reward: -252.53 | Avg(10): -158.11 | Epsilon: 0.137 | Time: 7.60s
Episode 397 | Total Reward: -359.93 | Avg(10): -181.36 | Epsilon: 0.137 | Time: 9.99s
Episode 398 | Total Reward: -438.84 | Avg(10): -225.09 | Epsilon: 0.136 | Time: 9.65s
Episode 399 | Total Reward: -116.09 | Avg(10): -213.46 | Epsilon: 0.135 | Time: 10.56s

--- Episode 400: Action Usage Analysis ---
Action distribution: [0.0751  0.07635 0.0704  0.06595 0.0719  0.0858  0.09855 0.09115 0.1066
 0.10305 0.15515]
Entropy (diversity): 2.366
--------------------------------------------------
Episode 400 | Total Reward: -128.75 | Avg(10): -201.08 | Epsilon: 0.135 | Time: 7.47s

Evaluating trained model...
Test Episode 1: Total Reward = -249.92
Test Episode 2: Total Reward = -358.65
Test Episode 3: Total Reward = -1.94
Test Episode 4: Total Reward = -475.13
Test Episode 5: Total Reward = -124.90
Test Episode 6: Total Reward = -126.36
Test Episode 7: Total Reward = -537.57
Test Episode 8: Total Reward = -380.44
Test Episode 9: Total Reward = -235.95
Test Episode 10: Total Reward = -248.49

Average Reward over 10 episodes: -273.93 ± 157.84
Best average reward over 10 episodes: -110.34
Best model weights saved to: 11act_400ep_extended_weights.h5
Total training time: 3448.73s

11act_400ep_extended Results:
Training best avg: -110.34
Evaluation: -273.93 ± 157.84
Training time: 3448.7s
================================================================================
Running: 21act_200ep_baseline
================================================================================

Model Summary:

Model Summary:
Model: "dqn_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_48 (Dense)            multiple                  256       
                                                                 
 dense_49 (Dense)            multiple                  4160      
                                                                 
 dense_50 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1071.04 | Avg(10): -1071.04 | Epsilon: 0.995 | Time: 0.07s
Episode 2 | Total Reward: -1325.16 | Avg(10): -1198.10 | Epsilon: 0.990 | Time: 0.04s
Episode 3 | Total Reward: -1278.84 | Avg(10): -1225.01 | Epsilon: 0.985 | Time: 0.03s
Episode 4 | Total Reward: -895.43 | Avg(10): -1142.62 | Epsilon: 0.980 | Time: 0.02s
Episode 5 | Total Reward: -874.60 | Avg(10): -1089.01 | Epsilon: 0.975 | Time: 0.15s
Episode 6 | Total Reward: -1068.26 | Avg(10): -1085.55 | Epsilon: 0.970 | Time: 6.76s
Episode 7 | Total Reward: -1001.54 | Avg(10): -1073.55 | Epsilon: 0.966 | Time: 6.80s
Episode 8 | Total Reward: -1437.40 | Avg(10): -1119.03 | Epsilon: 0.961 | Time: 6.91s
Episode 9 | Total Reward: -872.05 | Avg(10): -1091.59 | Epsilon: 0.956 | Time: 7.05s
Episode 10 | Total Reward: -1737.84 | Avg(10): -1156.22 | Epsilon: 0.951 | Time: 7.10s
Episode 11 | Total Reward: -1059.69 | Avg(10): -1155.08 | Epsilon: 0.946 | Time: 6.97s
Episode 12 | Total Reward: -1467.52 | Avg(10): -1169.32 | Epsilon: 0.942 | Time: 10.76s
Episode 13 | Total Reward: -1667.45 | Avg(10): -1208.18 | Epsilon: 0.937 | Time: 12.78s
Episode 14 | Total Reward: -869.30 | Avg(10): -1205.56 | Epsilon: 0.932 | Time: 10.25s
Episode 15 | Total Reward: -1226.10 | Avg(10): -1240.72 | Epsilon: 0.928 | Time: 7.30s
Episode 16 | Total Reward: -1163.18 | Avg(10): -1250.21 | Epsilon: 0.923 | Time: 7.35s
Episode 17 | Total Reward: -1374.93 | Avg(10): -1287.55 | Epsilon: 0.918 | Time: 7.33s
Episode 18 | Total Reward: -1016.26 | Avg(10): -1245.43 | Epsilon: 0.914 | Time: 7.15s
Episode 19 | Total Reward: -979.07 | Avg(10): -1256.13 | Epsilon: 0.909 | Time: 7.17s
Episode 20 | Total Reward: -1313.83 | Avg(10): -1213.73 | Epsilon: 0.905 | Time: 7.23s
Episode 21 | Total Reward: -1130.63 | Avg(10): -1220.83 | Epsilon: 0.900 | Time: 7.28s
Episode 22 | Total Reward: -1729.69 | Avg(10): -1247.04 | Epsilon: 0.896 | Time: 7.69s
Episode 23 | Total Reward: -970.83 | Avg(10): -1177.38 | Epsilon: 0.891 | Time: 7.34s
Episode 24 | Total Reward: -1315.00 | Avg(10): -1221.95 | Epsilon: 0.887 | Time: 10.64s
Episode 25 | Total Reward: -1665.13 | Avg(10): -1265.85 | Epsilon: 0.882 | Time: 7.36s
Episode 26 | Total Reward: -1226.60 | Avg(10): -1272.20 | Epsilon: 0.878 | Time: 7.31s
Episode 27 | Total Reward: -913.55 | Avg(10): -1226.06 | Epsilon: 0.873 | Time: 7.36s
Episode 28 | Total Reward: -1270.97 | Avg(10): -1251.53 | Epsilon: 0.869 | Time: 7.50s
Episode 29 | Total Reward: -1096.43 | Avg(10): -1263.27 | Epsilon: 0.865 | Time: 7.12s
Episode 30 | Total Reward: -1572.82 | Avg(10): -1289.17 | Epsilon: 0.860 | Time: 7.12s
Episode 31 | Total Reward: -1565.28 | Avg(10): -1332.63 | Epsilon: 0.856 | Time: 6.93s
Episode 32 | Total Reward: -1226.74 | Avg(10): -1282.33 | Epsilon: 0.852 | Time: 6.91s
Episode 33 | Total Reward: -1613.82 | Avg(10): -1346.63 | Epsilon: 0.848 | Time: 6.92s
Episode 34 | Total Reward: -970.85 | Avg(10): -1312.22 | Epsilon: 0.843 | Time: 6.86s
Episode 35 | Total Reward: -1477.12 | Avg(10): -1293.42 | Epsilon: 0.839 | Time: 6.90s
Episode 36 | Total Reward: -1592.66 | Avg(10): -1330.02 | Epsilon: 0.835 | Time: 6.75s
Episode 37 | Total Reward: -1023.54 | Avg(10): -1341.02 | Epsilon: 0.831 | Time: 6.78s
Episode 38 | Total Reward: -1462.79 | Avg(10): -1360.21 | Epsilon: 0.827 | Time: 6.74s
Episode 39 | Total Reward: -1460.55 | Avg(10): -1396.62 | Epsilon: 0.822 | Time: 7.23s
Episode 40 | Total Reward: -1404.51 | Avg(10): -1379.79 | Epsilon: 0.818 | Time: 6.98s
Episode 41 | Total Reward: -1092.54 | Avg(10): -1332.51 | Epsilon: 0.814 | Time: 7.02s
Episode 42 | Total Reward: -1713.37 | Avg(10): -1381.18 | Epsilon: 0.810 | Time: 7.06s
Episode 43 | Total Reward: -992.46 | Avg(10): -1319.04 | Epsilon: 0.806 | Time: 6.98s
Episode 44 | Total Reward: -1355.60 | Avg(10): -1357.52 | Epsilon: 0.802 | Time: 8.40s
Episode 45 | Total Reward: -1562.11 | Avg(10): -1366.01 | Epsilon: 0.798 | Time: 7.27s
Episode 46 | Total Reward: -877.56 | Avg(10): -1294.50 | Epsilon: 0.794 | Time: 6.99s
Episode 47 | Total Reward: -871.01 | Avg(10): -1279.25 | Epsilon: 0.790 | Time: 6.93s
Episode 48 | Total Reward: -1125.31 | Avg(10): -1245.50 | Epsilon: 0.786 | Time: 6.82s
Episode 49 | Total Reward: -1291.33 | Avg(10): -1228.58 | Epsilon: 0.782 | Time: 7.02s
Episode 50 | Total Reward: -1298.28 | Avg(10): -1217.96 | Epsilon: 0.778 | Time: 6.96s
Episode 51 | Total Reward: -1732.67 | Avg(10): -1281.97 | Epsilon: 0.774 | Time: 6.89s
Episode 52 | Total Reward: -1498.99 | Avg(10): -1260.53 | Epsilon: 0.771 | Time: 6.82s
Episode 53 | Total Reward: -995.77 | Avg(10): -1260.86 | Epsilon: 0.767 | Time: 6.79s
Episode 54 | Total Reward: -988.92 | Avg(10): -1224.20 | Epsilon: 0.763 | Time: 6.92s
Episode 55 | Total Reward: -977.34 | Avg(10): -1165.72 | Epsilon: 0.759 | Time: 6.88s
Episode 56 | Total Reward: -668.59 | Avg(10): -1144.82 | Epsilon: 0.755 | Time: 9.57s
Episode 57 | Total Reward: -936.87 | Avg(10): -1151.41 | Epsilon: 0.751 | Time: 11.82s
Episode 58 | Total Reward: -855.33 | Avg(10): -1124.41 | Epsilon: 0.748 | Time: 11.92s
Episode 59 | Total Reward: -884.78 | Avg(10): -1083.75 | Epsilon: 0.744 | Time: 9.69s
Episode 60 | Total Reward: -780.64 | Avg(10): -1031.99 | Epsilon: 0.740 | Time: 7.32s
Episode 61 | Total Reward: -870.74 | Avg(10): -945.80 | Epsilon: 0.737 | Time: 7.14s
Episode 62 | Total Reward: -1109.71 | Avg(10): -906.87 | Epsilon: 0.733 | Time: 8.95s
Episode 63 | Total Reward: -1390.29 | Avg(10): -946.32 | Epsilon: 0.729 | Time: 11.09s
Episode 64 | Total Reward: -860.93 | Avg(10): -933.52 | Epsilon: 0.726 | Time: 8.24s
Episode 65 | Total Reward: -1228.78 | Avg(10): -958.67 | Epsilon: 0.722 | Time: 7.07s
Episode 66 | Total Reward: -1492.96 | Avg(10): -1041.11 | Epsilon: 0.718 | Time: 8.69s
Episode 67 | Total Reward: -1267.81 | Avg(10): -1074.20 | Epsilon: 0.715 | Time: 5.65s
Episode 68 | Total Reward: -909.26 | Avg(10): -1079.59 | Epsilon: 0.711 | Time: 6.12s
Episode 69 | Total Reward: -1142.51 | Avg(10): -1105.36 | Epsilon: 0.708 | Time: 6.69s
Episode 70 | Total Reward: -1061.26 | Avg(10): -1133.43 | Epsilon: 0.704 | Time: 5.79s
Episode 71 | Total Reward: -1061.07 | Avg(10): -1152.46 | Epsilon: 0.701 | Time: 5.34s
Episode 72 | Total Reward: -1122.40 | Avg(10): -1153.73 | Epsilon: 0.697 | Time: 6.71s
Episode 73 | Total Reward: -861.02 | Avg(10): -1100.80 | Epsilon: 0.694 | Time: 5.79s
Episode 74 | Total Reward: -892.87 | Avg(10): -1103.99 | Epsilon: 0.690 | Time: 4.89s
Episode 75 | Total Reward: -1093.27 | Avg(10): -1090.44 | Epsilon: 0.687 | Time: 5.84s
Episode 76 | Total Reward: -1172.06 | Avg(10): -1058.35 | Epsilon: 0.683 | Time: 6.01s
Episode 77 | Total Reward: -1132.65 | Avg(10): -1044.84 | Epsilon: 0.680 | Time: 7.60s
Episode 78 | Total Reward: -874.03 | Avg(10): -1041.31 | Epsilon: 0.676 | Time: 10.94s
Episode 79 | Total Reward: -740.12 | Avg(10): -1001.07 | Epsilon: 0.673 | Time: 13.07s
Episode 80 | Total Reward: -1208.87 | Avg(10): -1015.84 | Epsilon: 0.670 | Time: 11.07s
Episode 81 | Total Reward: -1032.92 | Avg(10): -1013.02 | Epsilon: 0.666 | Time: 5.43s
Episode 82 | Total Reward: -958.17 | Avg(10): -996.60 | Epsilon: 0.663 | Time: 5.73s
Episode 83 | Total Reward: -1134.09 | Avg(10): -1023.91 | Epsilon: 0.660 | Time: 5.77s
Episode 84 | Total Reward: -1140.06 | Avg(10): -1048.62 | Epsilon: 0.656 | Time: 5.76s
Episode 85 | Total Reward: -1050.08 | Avg(10): -1044.31 | Epsilon: 0.653 | Time: 5.82s
Episode 86 | Total Reward: -1047.20 | Avg(10): -1031.82 | Epsilon: 0.650 | Time: 6.43s
Episode 87 | Total Reward: -1042.46 | Avg(10): -1022.80 | Epsilon: 0.647 | Time: 6.22s
Episode 88 | Total Reward: -1383.02 | Avg(10): -1073.70 | Epsilon: 0.643 | Time: 6.24s
Episode 89 | Total Reward: -1160.70 | Avg(10): -1115.76 | Epsilon: 0.640 | Time: 6.20s
Episode 90 | Total Reward: -1027.80 | Avg(10): -1097.65 | Epsilon: 0.637 | Time: 5.93s
Episode 91 | Total Reward: -1035.10 | Avg(10): -1097.87 | Epsilon: 0.634 | Time: 5.97s
Episode 92 | Total Reward: -903.18 | Avg(10): -1092.37 | Epsilon: 0.631 | Time: 6.10s
Episode 93 | Total Reward: -1214.28 | Avg(10): -1100.39 | Epsilon: 0.627 | Time: 5.90s
Episode 94 | Total Reward: -1042.68 | Avg(10): -1090.65 | Epsilon: 0.624 | Time: 6.01s
Episode 95 | Total Reward: -1110.91 | Avg(10): -1096.73 | Epsilon: 0.621 | Time: 6.35s
Episode 96 | Total Reward: -1058.53 | Avg(10): -1097.86 | Epsilon: 0.618 | Time: 6.45s
Episode 97 | Total Reward: -1175.52 | Avg(10): -1111.17 | Epsilon: 0.615 | Time: 6.07s
Episode 98 | Total Reward: -1124.26 | Avg(10): -1085.30 | Epsilon: 0.612 | Time: 6.16s
Episode 99 | Total Reward: -1004.23 | Avg(10): -1069.65 | Epsilon: 0.609 | Time: 5.99s

--- Episode 100: Action Usage Analysis ---
Action distribution: [0.0685  0.06105 0.05335 0.0444  0.04735 0.04505 0.0384  0.03875 0.0437
 0.0404  0.03685 0.04185 0.0395  0.0376  0.0459  0.047   0.04165 0.0449
 0.06035 0.0527  0.07075]
Entropy (diversity): 3.025
--------------------------------------------------
Episode 100 | Total Reward: -1163.63 | Avg(10): -1083.23 | Epsilon: 0.606 | Time: 6.34s
Episode 101 | Total Reward: -844.39 | Avg(10): -1064.16 | Epsilon: 0.603 | Time: 6.10s
Episode 102 | Total Reward: -1030.25 | Avg(10): -1076.87 | Epsilon: 0.600 | Time: 6.60s
Episode 103 | Total Reward: -940.68 | Avg(10): -1049.51 | Epsilon: 0.597 | Time: 6.94s
Episode 104 | Total Reward: -1181.74 | Avg(10): -1063.41 | Epsilon: 0.594 | Time: 7.62s
Episode 105 | Total Reward: -1066.85 | Avg(10): -1059.01 | Epsilon: 0.591 | Time: 7.83s
Episode 106 | Total Reward: -1049.41 | Avg(10): -1058.10 | Epsilon: 0.588 | Time: 6.03s
Episode 107 | Total Reward: -1036.52 | Avg(10): -1044.20 | Epsilon: 0.585 | Time: 5.98s
Episode 108 | Total Reward: -1116.21 | Avg(10): -1043.39 | Epsilon: 0.582 | Time: 6.63s
Episode 109 | Total Reward: -1069.13 | Avg(10): -1049.88 | Epsilon: 0.579 | Time: 7.06s
Episode 110 | Total Reward: -1032.71 | Avg(10): -1036.79 | Epsilon: 0.576 | Time: 6.59s
Episode 111 | Total Reward: -1074.12 | Avg(10): -1059.76 | Epsilon: 0.573 | Time: 7.42s
Episode 112 | Total Reward: -959.48 | Avg(10): -1052.68 | Epsilon: 0.570 | Time: 7.35s
Episode 113 | Total Reward: -1095.19 | Avg(10): -1068.14 | Epsilon: 0.568 | Time: 7.39s
Episode 114 | Total Reward: -1167.65 | Avg(10): -1066.73 | Epsilon: 0.565 | Time: 6.99s
Episode 115 | Total Reward: -1088.94 | Avg(10): -1068.94 | Epsilon: 0.562 | Time: 8.19s
Episode 116 | Total Reward: -957.73 | Avg(10): -1059.77 | Epsilon: 0.559 | Time: 6.87s
Episode 117 | Total Reward: -1007.47 | Avg(10): -1056.86 | Epsilon: 0.556 | Time: 7.27s
Episode 118 | Total Reward: -903.12 | Avg(10): -1035.55 | Epsilon: 0.554 | Time: 6.16s
Episode 119 | Total Reward: -971.75 | Avg(10): -1025.82 | Epsilon: 0.551 | Time: 5.89s
Episode 120 | Total Reward: -1109.98 | Avg(10): -1033.54 | Epsilon: 0.548 | Time: 5.67s
Episode 121 | Total Reward: -1000.50 | Avg(10): -1026.18 | Epsilon: 0.545 | Time: 5.95s
Episode 122 | Total Reward: -776.41 | Avg(10): -1007.88 | Epsilon: 0.543 | Time: 5.83s
Episode 123 | Total Reward: -1142.57 | Avg(10): -1012.61 | Epsilon: 0.540 | Time: 5.75s
Episode 124 | Total Reward: -1006.53 | Avg(10): -996.50 | Epsilon: 0.537 | Time: 5.96s
Episode 125 | Total Reward: -895.01 | Avg(10): -977.11 | Epsilon: 0.534 | Time: 7.14s
Episode 126 | Total Reward: -880.79 | Avg(10): -969.41 | Epsilon: 0.532 | Time: 6.91s
Episode 127 | Total Reward: -888.95 | Avg(10): -957.56 | Epsilon: 0.529 | Time: 6.14s
Episode 128 | Total Reward: -875.53 | Avg(10): -954.80 | Epsilon: 0.526 | Time: 6.55s
Episode 129 | Total Reward: -897.81 | Avg(10): -947.41 | Epsilon: 0.524 | Time: 6.64s
Episode 130 | Total Reward: -862.52 | Avg(10): -922.66 | Epsilon: 0.521 | Time: 6.47s
Episode 131 | Total Reward: -899.89 | Avg(10): -912.60 | Epsilon: 0.519 | Time: 6.18s
Episode 132 | Total Reward: -906.37 | Avg(10): -925.60 | Epsilon: 0.516 | Time: 6.80s
Episode 133 | Total Reward: -1038.25 | Avg(10): -915.16 | Epsilon: 0.513 | Time: 6.60s
Episode 134 | Total Reward: -875.00 | Avg(10): -902.01 | Epsilon: 0.511 | Time: 6.98s
Episode 135 | Total Reward: -748.85 | Avg(10): -887.39 | Epsilon: 0.508 | Time: 6.30s
Episode 136 | Total Reward: -1032.64 | Avg(10): -902.58 | Epsilon: 0.506 | Time: 6.32s
Episode 137 | Total Reward: -861.79 | Avg(10): -899.86 | Epsilon: 0.503 | Time: 6.20s
Episode 138 | Total Reward: -753.28 | Avg(10): -887.64 | Epsilon: 0.501 | Time: 6.64s
Episode 139 | Total Reward: -1036.17 | Avg(10): -901.47 | Epsilon: 0.498 | Time: 6.68s
Episode 140 | Total Reward: -738.49 | Avg(10): -889.07 | Epsilon: 0.496 | Time: 6.44s
Episode 141 | Total Reward: -868.89 | Avg(10): -885.97 | Epsilon: 0.493 | Time: 6.74s
Episode 142 | Total Reward: -758.17 | Avg(10): -871.15 | Epsilon: 0.491 | Time: 6.47s
Episode 143 | Total Reward: -876.97 | Avg(10): -855.02 | Epsilon: 0.488 | Time: 6.51s
Episode 144 | Total Reward: -623.30 | Avg(10): -829.85 | Epsilon: 0.486 | Time: 7.16s
Episode 145 | Total Reward: -619.18 | Avg(10): -816.89 | Epsilon: 0.483 | Time: 7.34s
Episode 146 | Total Reward: -726.57 | Avg(10): -786.28 | Epsilon: 0.481 | Time: 9.59s
Episode 147 | Total Reward: -604.40 | Avg(10): -760.54 | Epsilon: 0.479 | Time: 11.42s
Episode 148 | Total Reward: -1014.08 | Avg(10): -786.62 | Epsilon: 0.476 | Time: 9.77s
Episode 149 | Total Reward: -1040.82 | Avg(10): -787.09 | Epsilon: 0.474 | Time: 6.59s
Episode 150 | Total Reward: -869.49 | Avg(10): -800.19 | Epsilon: 0.471 | Time: 6.08s
Episode 151 | Total Reward: -253.09 | Avg(10): -738.61 | Epsilon: 0.469 | Time: 7.57s
Episode 152 | Total Reward: -736.14 | Avg(10): -736.40 | Epsilon: 0.467 | Time: 7.72s
Episode 153 | Total Reward: -557.31 | Avg(10): -704.44 | Epsilon: 0.464 | Time: 6.39s
Episode 154 | Total Reward: -510.87 | Avg(10): -693.19 | Epsilon: 0.462 | Time: 6.54s
Episode 155 | Total Reward: -405.81 | Avg(10): -671.86 | Epsilon: 0.460 | Time: 6.47s
Episode 156 | Total Reward: -712.04 | Avg(10): -670.40 | Epsilon: 0.458 | Time: 6.50s
Episode 157 | Total Reward: -934.82 | Avg(10): -703.45 | Epsilon: 0.455 | Time: 6.46s
Episode 158 | Total Reward: -383.14 | Avg(10): -640.35 | Epsilon: 0.453 | Time: 6.04s
Episode 159 | Total Reward: -382.50 | Avg(10): -574.52 | Epsilon: 0.451 | Time: 6.48s
Episode 160 | Total Reward: -263.28 | Avg(10): -513.90 | Epsilon: 0.448 | Time: 6.83s
Episode 161 | Total Reward: -715.92 | Avg(10): -560.18 | Epsilon: 0.446 | Time: 6.52s
Episode 162 | Total Reward: -382.70 | Avg(10): -524.84 | Epsilon: 0.444 | Time: 6.13s
Episode 163 | Total Reward: -510.60 | Avg(10): -520.17 | Epsilon: 0.442 | Time: 6.80s
Episode 164 | Total Reward: -379.12 | Avg(10): -506.99 | Epsilon: 0.440 | Time: 6.53s
Episode 165 | Total Reward: -359.72 | Avg(10): -502.38 | Epsilon: 0.437 | Time: 6.42s
Episode 166 | Total Reward: -502.46 | Avg(10): -481.42 | Epsilon: 0.435 | Time: 6.38s
Episode 167 | Total Reward: -373.64 | Avg(10): -425.31 | Epsilon: 0.433 | Time: 6.53s
Episode 168 | Total Reward: -366.49 | Avg(10): -423.64 | Epsilon: 0.431 | Time: 6.07s
Episode 169 | Total Reward: -522.96 | Avg(10): -437.69 | Epsilon: 0.429 | Time: 6.32s
Episode 170 | Total Reward: -637.65 | Avg(10): -475.13 | Epsilon: 0.427 | Time: 6.49s
Episode 171 | Total Reward: -582.47 | Avg(10): -461.78 | Epsilon: 0.424 | Time: 6.24s
Episode 172 | Total Reward: -705.43 | Avg(10): -494.05 | Epsilon: 0.422 | Time: 6.75s
Episode 173 | Total Reward: -1050.25 | Avg(10): -548.02 | Epsilon: 0.420 | Time: 6.03s
Episode 174 | Total Reward: -254.64 | Avg(10): -535.57 | Epsilon: 0.418 | Time: 5.43s
Episode 175 | Total Reward: -629.85 | Avg(10): -562.59 | Epsilon: 0.416 | Time: 4.90s
Episode 176 | Total Reward: -497.54 | Avg(10): -562.09 | Epsilon: 0.414 | Time: 7.02s
Episode 177 | Total Reward: -787.77 | Avg(10): -603.51 | Epsilon: 0.412 | Time: 6.60s
Episode 178 | Total Reward: -667.25 | Avg(10): -633.58 | Epsilon: 0.410 | Time: 5.93s
Episode 179 | Total Reward: -511.33 | Avg(10): -632.42 | Epsilon: 0.408 | Time: 6.02s
Episode 180 | Total Reward: -721.75 | Avg(10): -640.83 | Epsilon: 0.406 | Time: 6.50s
Episode 181 | Total Reward: -251.16 | Avg(10): -607.70 | Epsilon: 0.404 | Time: 5.93s
Episode 182 | Total Reward: -499.46 | Avg(10): -587.10 | Epsilon: 0.402 | Time: 6.09s
Episode 183 | Total Reward: -278.00 | Avg(10): -509.88 | Epsilon: 0.400 | Time: 6.35s
Episode 184 | Total Reward: -506.48 | Avg(10): -535.06 | Epsilon: 0.398 | Time: 6.59s
Episode 185 | Total Reward: -254.98 | Avg(10): -497.57 | Epsilon: 0.396 | Time: 7.64s
Episode 186 | Total Reward: -493.05 | Avg(10): -497.12 | Epsilon: 0.394 | Time: 6.28s
Episode 187 | Total Reward: -1459.52 | Avg(10): -564.30 | Epsilon: 0.392 | Time: 5.88s
Episode 188 | Total Reward: -439.39 | Avg(10): -541.51 | Epsilon: 0.390 | Time: 5.91s
Episode 189 | Total Reward: -920.38 | Avg(10): -582.42 | Epsilon: 0.388 | Time: 5.78s
Episode 190 | Total Reward: -361.98 | Avg(10): -546.44 | Epsilon: 0.386 | Time: 6.54s
Episode 191 | Total Reward: -751.59 | Avg(10): -596.48 | Epsilon: 0.384 | Time: 6.20s
Episode 192 | Total Reward: -453.68 | Avg(10): -591.91 | Epsilon: 0.382 | Time: 6.21s
Episode 193 | Total Reward: -373.89 | Avg(10): -601.50 | Epsilon: 0.380 | Time: 6.01s
Episode 194 | Total Reward: -503.69 | Avg(10): -601.22 | Epsilon: 0.378 | Time: 5.86s
Episode 195 | Total Reward: -866.74 | Avg(10): -662.39 | Epsilon: 0.376 | Time: 5.98s
Episode 196 | Total Reward: -374.46 | Avg(10): -650.53 | Epsilon: 0.374 | Time: 5.80s
Episode 197 | Total Reward: -370.20 | Avg(10): -541.60 | Epsilon: 0.373 | Time: 6.15s
Episode 198 | Total Reward: -374.90 | Avg(10): -535.15 | Epsilon: 0.371 | Time: 6.21s
Episode 199 | Total Reward: -631.44 | Avg(10): -506.26 | Epsilon: 0.369 | Time: 5.88s

--- Episode 200: Action Usage Analysis ---
Action distribution: [0.11565 0.04935 0.04615 0.0464  0.03505 0.03445 0.0286  0.0354  0.02745
 0.0395  0.03775 0.0542  0.0316  0.026   0.0328  0.03435 0.02855 0.03615
 0.14605 0.0462  0.06835]
Entropy (diversity): 2.904
--------------------------------------------------
Episode 200 | Total Reward: -722.58 | Avg(10): -542.32 | Epsilon: 0.367 | Time: 6.01s

Evaluating trained model...
Test Episode 1: Total Reward = -700.64
Test Episode 2: Total Reward = -902.23
Test Episode 3: Total Reward = -510.18
Test Episode 4: Total Reward = -435.39
Test Episode 5: Total Reward = -636.19
Test Episode 6: Total Reward = -650.03
Test Episode 7: Total Reward = -773.96
Test Episode 8: Total Reward = -761.97
Test Episode 9: Total Reward = -901.67
Test Episode 10: Total Reward = -519.21

Average Reward over 10 episodes: -679.15 ± 152.26
Best average reward over 10 episodes: -423.64
Best model weights saved to: 21act_200ep_baseline_weights.h5
Total training time: 1354.32s

21act_200ep_baseline Results:
Training best avg: -423.64
Evaluation: -679.15 ± 152.26
Training time: 1354.3s
================================================================================
Running: 21act_600ep_extended
================================================================================

Model Summary:

Model Summary:
Model: "dqn_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_54 (Dense)            multiple                  256       
                                                                 
 dense_55 (Dense)            multiple                  4160      
                                                                 
 dense_56 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1260.21 | Avg(10): -1260.21 | Epsilon: 0.995 | Time: 0.02s
Episode 2 | Total Reward: -853.05 | Avg(10): -1056.63 | Epsilon: 0.990 | Time: 0.03s
Episode 3 | Total Reward: -1099.96 | Avg(10): -1071.07 | Epsilon: 0.985 | Time: 0.02s
Episode 4 | Total Reward: -1057.87 | Avg(10): -1067.77 | Epsilon: 0.980 | Time: 0.03s
Episode 5 | Total Reward: -1485.19 | Avg(10): -1151.25 | Epsilon: 0.975 | Time: 0.11s
Episode 6 | Total Reward: -1189.53 | Avg(10): -1157.63 | Epsilon: 0.970 | Time: 5.77s
Episode 7 | Total Reward: -756.29 | Avg(10): -1100.30 | Epsilon: 0.966 | Time: 5.43s
Episode 8 | Total Reward: -996.25 | Avg(10): -1087.29 | Epsilon: 0.961 | Time: 5.91s
Episode 9 | Total Reward: -1380.12 | Avg(10): -1119.83 | Epsilon: 0.956 | Time: 5.96s
Episode 10 | Total Reward: -1192.09 | Avg(10): -1127.06 | Epsilon: 0.951 | Time: 5.53s
Episode 11 | Total Reward: -1573.70 | Avg(10): -1158.40 | Epsilon: 0.946 | Time: 5.77s
Episode 12 | Total Reward: -861.10 | Avg(10): -1159.21 | Epsilon: 0.942 | Time: 5.64s
Episode 13 | Total Reward: -955.30 | Avg(10): -1144.74 | Epsilon: 0.937 | Time: 5.95s
Episode 14 | Total Reward: -983.57 | Avg(10): -1137.31 | Epsilon: 0.932 | Time: 5.43s
Episode 15 | Total Reward: -1741.77 | Avg(10): -1162.97 | Epsilon: 0.928 | Time: 5.58s
Episode 16 | Total Reward: -1071.64 | Avg(10): -1151.18 | Epsilon: 0.923 | Time: 5.62s
Episode 17 | Total Reward: -1076.70 | Avg(10): -1183.22 | Epsilon: 0.918 | Time: 5.81s
Episode 18 | Total Reward: -871.77 | Avg(10): -1170.78 | Epsilon: 0.914 | Time: 5.75s
Episode 19 | Total Reward: -1645.05 | Avg(10): -1197.27 | Epsilon: 0.909 | Time: 5.80s
Episode 20 | Total Reward: -1059.14 | Avg(10): -1183.97 | Epsilon: 0.905 | Time: 5.77s
Episode 21 | Total Reward: -1304.44 | Avg(10): -1157.05 | Epsilon: 0.900 | Time: 5.71s
Episode 22 | Total Reward: -1000.41 | Avg(10): -1170.98 | Epsilon: 0.896 | Time: 5.49s
Episode 23 | Total Reward: -1318.98 | Avg(10): -1207.35 | Epsilon: 0.891 | Time: 5.74s
Episode 24 | Total Reward: -1193.88 | Avg(10): -1228.38 | Epsilon: 0.887 | Time: 5.91s
Episode 25 | Total Reward: -1604.38 | Avg(10): -1214.64 | Epsilon: 0.882 | Time: 5.82s
Episode 26 | Total Reward: -1127.63 | Avg(10): -1220.24 | Epsilon: 0.878 | Time: 6.34s
Episode 27 | Total Reward: -1301.74 | Avg(10): -1242.74 | Epsilon: 0.873 | Time: 5.94s
Episode 28 | Total Reward: -1096.30 | Avg(10): -1265.20 | Epsilon: 0.869 | Time: 6.10s
Episode 29 | Total Reward: -1278.39 | Avg(10): -1228.53 | Epsilon: 0.865 | Time: 6.08s
Episode 30 | Total Reward: -1440.13 | Avg(10): -1266.63 | Epsilon: 0.860 | Time: 5.66s
Episode 31 | Total Reward: -1301.26 | Avg(10): -1266.31 | Epsilon: 0.856 | Time: 5.43s
Episode 32 | Total Reward: -960.20 | Avg(10): -1262.29 | Epsilon: 0.852 | Time: 4.95s
Episode 33 | Total Reward: -1159.26 | Avg(10): -1246.32 | Epsilon: 0.848 | Time: 4.97s
Episode 34 | Total Reward: -755.83 | Avg(10): -1202.51 | Epsilon: 0.843 | Time: 4.85s
Episode 35 | Total Reward: -1675.50 | Avg(10): -1209.63 | Epsilon: 0.839 | Time: 5.19s
Episode 36 | Total Reward: -866.48 | Avg(10): -1183.51 | Epsilon: 0.835 | Time: 6.65s
Episode 37 | Total Reward: -1770.91 | Avg(10): -1230.43 | Epsilon: 0.831 | Time: 6.78s
Episode 38 | Total Reward: -1441.03 | Avg(10): -1264.90 | Epsilon: 0.827 | Time: 6.06s
Episode 39 | Total Reward: -1212.46 | Avg(10): -1258.31 | Epsilon: 0.822 | Time: 6.71s
Episode 40 | Total Reward: -1092.69 | Avg(10): -1223.56 | Epsilon: 0.818 | Time: 6.60s
Episode 41 | Total Reward: -974.74 | Avg(10): -1190.91 | Epsilon: 0.814 | Time: 6.39s
Episode 42 | Total Reward: -902.98 | Avg(10): -1185.19 | Epsilon: 0.810 | Time: 5.97s
Episode 43 | Total Reward: -1619.62 | Avg(10): -1231.23 | Epsilon: 0.806 | Time: 5.94s
Episode 44 | Total Reward: -1079.33 | Avg(10): -1263.58 | Epsilon: 0.802 | Time: 5.91s
Episode 45 | Total Reward: -1314.63 | Avg(10): -1227.49 | Epsilon: 0.798 | Time: 6.56s
Episode 46 | Total Reward: -1648.69 | Avg(10): -1305.71 | Epsilon: 0.794 | Time: 5.96s
Episode 47 | Total Reward: -1298.06 | Avg(10): -1258.42 | Epsilon: 0.790 | Time: 5.68s
Episode 48 | Total Reward: -1077.45 | Avg(10): -1222.07 | Epsilon: 0.786 | Time: 6.02s
Episode 49 | Total Reward: -1811.76 | Avg(10): -1282.00 | Epsilon: 0.782 | Time: 5.75s
Episode 50 | Total Reward: -1005.36 | Avg(10): -1273.26 | Epsilon: 0.778 | Time: 5.68s
Episode 51 | Total Reward: -1083.98 | Avg(10): -1284.19 | Epsilon: 0.774 | Time: 5.60s
Episode 52 | Total Reward: -906.03 | Avg(10): -1284.49 | Epsilon: 0.771 | Time: 5.55s
Episode 53 | Total Reward: -1206.68 | Avg(10): -1243.20 | Epsilon: 0.767 | Time: 5.50s
Episode 54 | Total Reward: -1012.59 | Avg(10): -1236.52 | Epsilon: 0.763 | Time: 5.42s
Episode 55 | Total Reward: -908.45 | Avg(10): -1195.90 | Epsilon: 0.759 | Time: 5.52s
Episode 56 | Total Reward: -1200.20 | Avg(10): -1151.06 | Epsilon: 0.755 | Time: 5.50s
Episode 57 | Total Reward: -1264.69 | Avg(10): -1147.72 | Epsilon: 0.751 | Time: 5.77s
Episode 58 | Total Reward: -1138.07 | Avg(10): -1153.78 | Epsilon: 0.748 | Time: 5.44s
Episode 59 | Total Reward: -864.91 | Avg(10): -1059.09 | Epsilon: 0.744 | Time: 5.43s
Episode 60 | Total Reward: -1577.72 | Avg(10): -1116.33 | Epsilon: 0.740 | Time: 5.21s
Episode 61 | Total Reward: -905.24 | Avg(10): -1098.46 | Epsilon: 0.737 | Time: 5.62s
Episode 62 | Total Reward: -783.01 | Avg(10): -1086.16 | Epsilon: 0.733 | Time: 5.29s
Episode 63 | Total Reward: -792.55 | Avg(10): -1044.74 | Epsilon: 0.729 | Time: 5.55s
Episode 64 | Total Reward: -931.95 | Avg(10): -1036.68 | Epsilon: 0.726 | Time: 6.98s
Episode 65 | Total Reward: -1276.96 | Avg(10): -1073.53 | Epsilon: 0.722 | Time: 6.08s
Episode 66 | Total Reward: -1033.37 | Avg(10): -1056.85 | Epsilon: 0.718 | Time: 6.39s
Episode 67 | Total Reward: -1051.38 | Avg(10): -1035.52 | Epsilon: 0.715 | Time: 5.86s
Episode 68 | Total Reward: -973.91 | Avg(10): -1019.10 | Epsilon: 0.711 | Time: 5.35s
Episode 69 | Total Reward: -1028.06 | Avg(10): -1035.42 | Epsilon: 0.708 | Time: 5.68s
Episode 70 | Total Reward: -1328.87 | Avg(10): -1010.53 | Epsilon: 0.704 | Time: 5.39s
Episode 71 | Total Reward: -1430.29 | Avg(10): -1063.04 | Epsilon: 0.701 | Time: 5.77s
Episode 72 | Total Reward: -1351.56 | Avg(10): -1119.89 | Epsilon: 0.697 | Time: 6.57s
Episode 73 | Total Reward: -1012.96 | Avg(10): -1141.93 | Epsilon: 0.694 | Time: 5.70s
Episode 74 | Total Reward: -978.68 | Avg(10): -1146.60 | Epsilon: 0.690 | Time: 5.24s
Episode 75 | Total Reward: -1114.90 | Avg(10): -1130.40 | Epsilon: 0.687 | Time: 5.46s
Episode 76 | Total Reward: -887.93 | Avg(10): -1115.85 | Epsilon: 0.683 | Time: 5.57s
Episode 77 | Total Reward: -1009.32 | Avg(10): -1111.65 | Epsilon: 0.680 | Time: 6.16s
Episode 78 | Total Reward: -1195.85 | Avg(10): -1133.84 | Epsilon: 0.676 | Time: 5.70s
Episode 79 | Total Reward: -986.25 | Avg(10): -1129.66 | Epsilon: 0.673 | Time: 5.56s
Episode 80 | Total Reward: -1036.30 | Avg(10): -1100.40 | Epsilon: 0.670 | Time: 6.05s
Episode 81 | Total Reward: -1133.88 | Avg(10): -1070.76 | Epsilon: 0.666 | Time: 5.61s
Episode 82 | Total Reward: -977.99 | Avg(10): -1033.41 | Epsilon: 0.663 | Time: 5.91s
Episode 83 | Total Reward: -1064.45 | Avg(10): -1038.55 | Epsilon: 0.660 | Time: 5.71s
Episode 84 | Total Reward: -1161.56 | Avg(10): -1056.84 | Epsilon: 0.656 | Time: 6.17s
Episode 85 | Total Reward: -1254.52 | Avg(10): -1070.80 | Epsilon: 0.653 | Time: 6.26s
Episode 86 | Total Reward: -889.53 | Avg(10): -1070.96 | Epsilon: 0.650 | Time: 5.87s
Episode 87 | Total Reward: -1234.20 | Avg(10): -1093.45 | Epsilon: 0.647 | Time: 5.51s
Episode 88 | Total Reward: -917.96 | Avg(10): -1065.66 | Epsilon: 0.643 | Time: 5.72s
Episode 89 | Total Reward: -1042.33 | Avg(10): -1071.27 | Epsilon: 0.640 | Time: 5.93s
Episode 90 | Total Reward: -1173.32 | Avg(10): -1084.97 | Epsilon: 0.637 | Time: 5.76s
Episode 91 | Total Reward: -693.76 | Avg(10): -1040.96 | Epsilon: 0.634 | Time: 5.91s
Episode 92 | Total Reward: -969.51 | Avg(10): -1040.11 | Epsilon: 0.631 | Time: 6.08s
Episode 93 | Total Reward: -1311.49 | Avg(10): -1064.82 | Epsilon: 0.627 | Time: 5.79s
Episode 94 | Total Reward: -1092.32 | Avg(10): -1057.89 | Epsilon: 0.624 | Time: 5.74s
Episode 95 | Total Reward: -882.39 | Avg(10): -1020.68 | Epsilon: 0.621 | Time: 5.91s
Episode 96 | Total Reward: -1179.28 | Avg(10): -1049.65 | Epsilon: 0.618 | Time: 5.98s
Episode 97 | Total Reward: -1085.11 | Avg(10): -1034.75 | Epsilon: 0.615 | Time: 6.00s
Episode 98 | Total Reward: -1048.00 | Avg(10): -1047.75 | Epsilon: 0.612 | Time: 5.96s
Episode 99 | Total Reward: -724.55 | Avg(10): -1015.97 | Epsilon: 0.609 | Time: 5.97s

--- Episode 100: Action Usage Analysis ---
Action distribution: [0.07185 0.046   0.04695 0.04285 0.04695 0.05485 0.0371  0.0429  0.0436
 0.04185 0.04025 0.0483  0.03955 0.0374  0.04315 0.0501  0.04    0.04505
 0.05665 0.06305 0.0616 ]
Entropy (diversity): 3.028
--------------------------------------------------
Episode 100 | Total Reward: -1020.09 | Avg(10): -1000.65 | Epsilon: 0.606 | Time: 6.05s
Episode 101 | Total Reward: -1109.17 | Avg(10): -1042.19 | Epsilon: 0.603 | Time: 5.96s
Episode 102 | Total Reward: -1239.24 | Avg(10): -1069.16 | Epsilon: 0.600 | Time: 6.09s
Episode 103 | Total Reward: -924.93 | Avg(10): -1030.51 | Epsilon: 0.597 | Time: 6.11s
Episode 104 | Total Reward: -753.51 | Avg(10): -996.63 | Epsilon: 0.594 | Time: 6.66s
Episode 105 | Total Reward: -1192.71 | Avg(10): -1027.66 | Epsilon: 0.591 | Time: 6.42s
Episode 106 | Total Reward: -1017.93 | Avg(10): -1011.52 | Epsilon: 0.588 | Time: 6.52s
Episode 107 | Total Reward: -1290.64 | Avg(10): -1032.08 | Epsilon: 0.585 | Time: 7.55s
Episode 108 | Total Reward: -1130.91 | Avg(10): -1040.37 | Epsilon: 0.582 | Time: 5.92s
Episode 109 | Total Reward: -1045.01 | Avg(10): -1072.41 | Epsilon: 0.579 | Time: 5.57s
Episode 110 | Total Reward: -907.94 | Avg(10): -1061.20 | Epsilon: 0.576 | Time: 5.91s
Episode 111 | Total Reward: -1029.73 | Avg(10): -1053.25 | Epsilon: 0.573 | Time: 5.66s
Episode 112 | Total Reward: -1004.60 | Avg(10): -1029.79 | Epsilon: 0.570 | Time: 6.42s
Episode 113 | Total Reward: -958.91 | Avg(10): -1033.19 | Epsilon: 0.568 | Time: 5.42s
Episode 114 | Total Reward: -1249.90 | Avg(10): -1082.83 | Epsilon: 0.565 | Time: 5.56s
Episode 115 | Total Reward: -1088.13 | Avg(10): -1072.37 | Epsilon: 0.562 | Time: 5.42s
Episode 116 | Total Reward: -919.75 | Avg(10): -1062.55 | Epsilon: 0.559 | Time: 6.03s
Episode 117 | Total Reward: -889.75 | Avg(10): -1022.46 | Epsilon: 0.556 | Time: 6.38s
Episode 118 | Total Reward: -998.87 | Avg(10): -1009.26 | Epsilon: 0.554 | Time: 5.46s
Episode 119 | Total Reward: -1165.03 | Avg(10): -1021.26 | Epsilon: 0.551 | Time: 6.18s
Episode 120 | Total Reward: -1068.79 | Avg(10): -1037.35 | Epsilon: 0.548 | Time: 5.89s
Episode 121 | Total Reward: -1182.64 | Avg(10): -1052.64 | Epsilon: 0.545 | Time: 5.85s
Episode 122 | Total Reward: -1061.75 | Avg(10): -1058.35 | Epsilon: 0.543 | Time: 5.82s
Episode 123 | Total Reward: -873.84 | Avg(10): -1049.84 | Epsilon: 0.540 | Time: 5.79s
Episode 124 | Total Reward: -1153.31 | Avg(10): -1040.19 | Epsilon: 0.537 | Time: 5.96s
Episode 125 | Total Reward: -1021.27 | Avg(10): -1033.50 | Epsilon: 0.534 | Time: 5.87s
Episode 126 | Total Reward: -775.44 | Avg(10): -1019.07 | Epsilon: 0.532 | Time: 6.02s
Episode 127 | Total Reward: -1169.42 | Avg(10): -1047.04 | Epsilon: 0.529 | Time: 5.98s
Episode 128 | Total Reward: -972.88 | Avg(10): -1044.44 | Epsilon: 0.526 | Time: 5.95s
Episode 129 | Total Reward: -1115.44 | Avg(10): -1039.48 | Epsilon: 0.524 | Time: 5.87s
Episode 130 | Total Reward: -903.40 | Avg(10): -1022.94 | Epsilon: 0.521 | Time: 5.83s
Episode 131 | Total Reward: -787.64 | Avg(10): -983.44 | Epsilon: 0.519 | Time: 5.95s
Episode 132 | Total Reward: -1157.69 | Avg(10): -993.03 | Epsilon: 0.516 | Time: 5.93s
Episode 133 | Total Reward: -1119.54 | Avg(10): -1017.60 | Epsilon: 0.513 | Time: 5.98s
Episode 134 | Total Reward: -915.32 | Avg(10): -993.80 | Epsilon: 0.511 | Time: 6.08s
Episode 135 | Total Reward: -533.06 | Avg(10): -944.98 | Epsilon: 0.508 | Time: 5.54s
Episode 136 | Total Reward: -1228.86 | Avg(10): -990.32 | Epsilon: 0.506 | Time: 6.13s
Episode 137 | Total Reward: -899.04 | Avg(10): -963.29 | Epsilon: 0.503 | Time: 5.67s
Episode 138 | Total Reward: -1190.39 | Avg(10): -985.04 | Epsilon: 0.501 | Time: 5.73s
Episode 139 | Total Reward: -521.29 | Avg(10): -925.62 | Epsilon: 0.498 | Time: 5.76s
Episode 140 | Total Reward: -871.51 | Avg(10): -922.43 | Epsilon: 0.496 | Time: 5.93s
Episode 141 | Total Reward: -879.51 | Avg(10): -931.62 | Epsilon: 0.493 | Time: 5.87s
Episode 142 | Total Reward: -982.96 | Avg(10): -914.15 | Epsilon: 0.491 | Time: 7.21s
Episode 143 | Total Reward: -728.04 | Avg(10): -875.00 | Epsilon: 0.488 | Time: 6.27s
Episode 144 | Total Reward: -880.95 | Avg(10): -871.56 | Epsilon: 0.486 | Time: 6.11s
Episode 145 | Total Reward: -763.74 | Avg(10): -894.63 | Epsilon: 0.483 | Time: 6.31s
Episode 146 | Total Reward: -903.06 | Avg(10): -862.05 | Epsilon: 0.481 | Time: 6.11s
Episode 147 | Total Reward: -513.93 | Avg(10): -823.54 | Epsilon: 0.479 | Time: 6.68s
Episode 148 | Total Reward: -458.15 | Avg(10): -750.31 | Epsilon: 0.476 | Time: 5.91s
Episode 149 | Total Reward: -616.33 | Avg(10): -759.82 | Epsilon: 0.474 | Time: 5.75s
Episode 150 | Total Reward: -772.64 | Avg(10): -749.93 | Epsilon: 0.471 | Time: 7.18s
Episode 151 | Total Reward: -942.90 | Avg(10): -756.27 | Epsilon: 0.469 | Time: 6.27s
Episode 152 | Total Reward: -845.58 | Avg(10): -742.53 | Epsilon: 0.467 | Time: 6.46s
Episode 153 | Total Reward: -625.96 | Avg(10): -732.32 | Epsilon: 0.464 | Time: 5.94s
Episode 154 | Total Reward: -735.70 | Avg(10): -717.80 | Epsilon: 0.462 | Time: 6.44s
Episode 155 | Total Reward: -626.61 | Avg(10): -704.09 | Epsilon: 0.460 | Time: 5.82s
Episode 156 | Total Reward: -864.68 | Avg(10): -700.25 | Epsilon: 0.458 | Time: 5.59s
Episode 157 | Total Reward: -725.36 | Avg(10): -721.39 | Epsilon: 0.455 | Time: 6.49s
Episode 158 | Total Reward: -866.71 | Avg(10): -762.25 | Epsilon: 0.453 | Time: 5.80s
Episode 159 | Total Reward: -615.76 | Avg(10): -762.19 | Epsilon: 0.451 | Time: 5.87s
Episode 160 | Total Reward: -605.21 | Avg(10): -745.45 | Epsilon: 0.448 | Time: 6.16s
Episode 161 | Total Reward: -955.43 | Avg(10): -746.70 | Epsilon: 0.446 | Time: 6.12s
Episode 162 | Total Reward: -621.55 | Avg(10): -724.30 | Epsilon: 0.444 | Time: 6.00s
Episode 163 | Total Reward: -658.26 | Avg(10): -727.53 | Epsilon: 0.442 | Time: 6.18s
Episode 164 | Total Reward: -768.88 | Avg(10): -730.84 | Epsilon: 0.440 | Time: 5.99s
Episode 165 | Total Reward: -541.66 | Avg(10): -722.35 | Epsilon: 0.437 | Time: 6.13s
Episode 166 | Total Reward: -970.96 | Avg(10): -732.98 | Epsilon: 0.435 | Time: 5.79s
Episode 167 | Total Reward: -748.41 | Avg(10): -735.28 | Epsilon: 0.433 | Time: 6.30s
Episode 168 | Total Reward: -991.35 | Avg(10): -747.75 | Epsilon: 0.431 | Time: 5.98s
Episode 169 | Total Reward: -1016.90 | Avg(10): -787.86 | Epsilon: 0.429 | Time: 6.08s
Episode 170 | Total Reward: -506.62 | Avg(10): -778.00 | Epsilon: 0.427 | Time: 6.16s
Episode 171 | Total Reward: -975.70 | Avg(10): -780.03 | Epsilon: 0.424 | Time: 6.01s
Episode 172 | Total Reward: -860.11 | Avg(10): -803.88 | Epsilon: 0.422 | Time: 6.23s
Episode 173 | Total Reward: -1049.59 | Avg(10): -843.02 | Epsilon: 0.420 | Time: 5.99s
Episode 174 | Total Reward: -635.67 | Avg(10): -829.70 | Epsilon: 0.418 | Time: 5.77s
Episode 175 | Total Reward: -1006.57 | Avg(10): -876.19 | Epsilon: 0.416 | Time: 5.87s
Episode 176 | Total Reward: -627.20 | Avg(10): -841.81 | Epsilon: 0.414 | Time: 5.77s
Episode 177 | Total Reward: -319.50 | Avg(10): -798.92 | Epsilon: 0.412 | Time: 5.77s
Episode 178 | Total Reward: -506.71 | Avg(10): -750.46 | Epsilon: 0.410 | Time: 5.89s
Episode 179 | Total Reward: -570.86 | Avg(10): -705.85 | Epsilon: 0.408 | Time: 6.18s
Episode 180 | Total Reward: -1073.87 | Avg(10): -762.58 | Epsilon: 0.406 | Time: 6.04s
Episode 181 | Total Reward: -229.95 | Avg(10): -688.00 | Epsilon: 0.404 | Time: 6.33s
Episode 182 | Total Reward: -373.71 | Avg(10): -639.36 | Epsilon: 0.402 | Time: 7.05s
Episode 183 | Total Reward: -254.75 | Avg(10): -559.88 | Epsilon: 0.400 | Time: 6.59s
Episode 184 | Total Reward: -506.25 | Avg(10): -546.94 | Epsilon: 0.398 | Time: 6.21s
Episode 185 | Total Reward: -487.93 | Avg(10): -495.07 | Epsilon: 0.396 | Time: 6.67s
Episode 186 | Total Reward: -470.49 | Avg(10): -479.40 | Epsilon: 0.394 | Time: 7.28s
Episode 187 | Total Reward: -606.65 | Avg(10): -508.12 | Epsilon: 0.392 | Time: 6.15s
Episode 188 | Total Reward: -261.75 | Avg(10): -483.62 | Epsilon: 0.390 | Time: 6.03s
Episode 189 | Total Reward: -487.98 | Avg(10): -475.33 | Epsilon: 0.388 | Time: 6.70s
Episode 190 | Total Reward: -763.40 | Avg(10): -444.29 | Epsilon: 0.386 | Time: 5.99s
Episode 191 | Total Reward: -253.57 | Avg(10): -446.65 | Epsilon: 0.384 | Time: 6.64s
Episode 192 | Total Reward: -494.78 | Avg(10): -458.75 | Epsilon: 0.382 | Time: 5.84s
Episode 193 | Total Reward: -254.10 | Avg(10): -458.69 | Epsilon: 0.380 | Time: 5.39s
Episode 194 | Total Reward: -250.38 | Avg(10): -433.10 | Epsilon: 0.378 | Time: 5.46s
Episode 195 | Total Reward: -628.30 | Avg(10): -447.14 | Epsilon: 0.376 | Time: 5.79s
Episode 196 | Total Reward: -510.82 | Avg(10): -451.17 | Epsilon: 0.374 | Time: 6.30s
Episode 197 | Total Reward: -630.11 | Avg(10): -453.52 | Epsilon: 0.373 | Time: 5.74s
Episode 198 | Total Reward: -621.38 | Avg(10): -489.48 | Epsilon: 0.371 | Time: 5.90s
Episode 199 | Total Reward: -627.87 | Avg(10): -503.47 | Epsilon: 0.369 | Time: 6.03s

--- Episode 200: Action Usage Analysis ---
Action distribution: [0.0859  0.0548  0.0468  0.12925 0.03375 0.0385  0.025   0.0318  0.02475
 0.0441  0.0296  0.03795 0.03315 0.03305 0.0329  0.03715 0.03135 0.04255
 0.0485  0.0496  0.10955]
Entropy (diversity): 2.918
--------------------------------------------------
Episode 200 | Total Reward: -379.31 | Avg(10): -465.06 | Epsilon: 0.367 | Time: 6.52s
Episode 201 | Total Reward: -750.13 | Avg(10): -514.72 | Epsilon: 0.365 | Time: 5.68s
Episode 202 | Total Reward: -694.49 | Avg(10): -534.69 | Epsilon: 0.363 | Time: 4.87s
Episode 203 | Total Reward: -380.11 | Avg(10): -547.29 | Epsilon: 0.361 | Time: 4.79s
Episode 204 | Total Reward: -457.97 | Avg(10): -568.05 | Epsilon: 0.360 | Time: 4.67s
Episode 205 | Total Reward: -506.86 | Avg(10): -555.91 | Epsilon: 0.358 | Time: 4.79s
Episode 206 | Total Reward: -386.97 | Avg(10): -543.52 | Epsilon: 0.356 | Time: 4.68s
Episode 207 | Total Reward: -1001.89 | Avg(10): -580.70 | Epsilon: 0.354 | Time: 4.90s
Episode 208 | Total Reward: -1053.61 | Avg(10): -623.92 | Epsilon: 0.353 | Time: 4.74s
Episode 209 | Total Reward: -905.91 | Avg(10): -651.72 | Epsilon: 0.351 | Time: 4.66s
Episode 210 | Total Reward: -818.04 | Avg(10): -695.60 | Epsilon: 0.349 | Time: 4.73s
Episode 211 | Total Reward: -380.30 | Avg(10): -658.61 | Epsilon: 0.347 | Time: 4.77s
Episode 212 | Total Reward: -514.49 | Avg(10): -640.61 | Epsilon: 0.346 | Time: 4.79s
Episode 213 | Total Reward: -625.15 | Avg(10): -665.12 | Epsilon: 0.344 | Time: 4.83s
Episode 214 | Total Reward: -255.55 | Avg(10): -644.88 | Epsilon: 0.342 | Time: 4.82s
Episode 215 | Total Reward: -382.30 | Avg(10): -632.42 | Epsilon: 0.340 | Time: 4.92s
Episode 216 | Total Reward: -499.64 | Avg(10): -643.69 | Epsilon: 0.339 | Time: 4.88s
Episode 217 | Total Reward: -254.24 | Avg(10): -568.92 | Epsilon: 0.337 | Time: 4.81s
Episode 218 | Total Reward: -374.90 | Avg(10): -501.05 | Epsilon: 0.335 | Time: 4.82s
Episode 219 | Total Reward: -374.20 | Avg(10): -447.88 | Epsilon: 0.334 | Time: 4.92s
Episode 220 | Total Reward: -368.13 | Avg(10): -402.89 | Epsilon: 0.332 | Time: 5.61s
Episode 221 | Total Reward: -393.49 | Avg(10): -404.21 | Epsilon: 0.330 | Time: 5.04s
Episode 222 | Total Reward: -376.77 | Avg(10): -390.44 | Epsilon: 0.329 | Time: 4.89s
Episode 223 | Total Reward: -372.97 | Avg(10): -365.22 | Epsilon: 0.327 | Time: 5.56s
Episode 224 | Total Reward: -129.62 | Avg(10): -352.63 | Epsilon: 0.325 | Time: 5.96s
Episode 225 | Total Reward: -618.99 | Avg(10): -376.29 | Epsilon: 0.324 | Time: 6.86s
Episode 226 | Total Reward: -249.18 | Avg(10): -351.25 | Epsilon: 0.322 | Time: 6.90s
Episode 227 | Total Reward: -509.80 | Avg(10): -376.80 | Epsilon: 0.321 | Time: 6.34s
Episode 228 | Total Reward: -853.40 | Avg(10): -424.65 | Epsilon: 0.319 | Time: 6.26s
Episode 229 | Total Reward: -736.92 | Avg(10): -460.93 | Epsilon: 0.317 | Time: 6.19s
Episode 230 | Total Reward: -359.44 | Avg(10): -460.06 | Epsilon: 0.316 | Time: 6.30s
Episode 231 | Total Reward: -374.30 | Avg(10): -458.14 | Epsilon: 0.314 | Time: 6.24s
Episode 232 | Total Reward: -497.29 | Avg(10): -470.19 | Epsilon: 0.313 | Time: 6.05s
Episode 233 | Total Reward: -724.44 | Avg(10): -505.34 | Epsilon: 0.311 | Time: 6.07s
Episode 234 | Total Reward: -353.58 | Avg(10): -527.73 | Epsilon: 0.309 | Time: 5.60s
Episode 235 | Total Reward: -359.29 | Avg(10): -501.76 | Epsilon: 0.308 | Time: 5.79s
Episode 236 | Total Reward: -604.49 | Avg(10): -537.29 | Epsilon: 0.306 | Time: 5.98s
Episode 237 | Total Reward: -255.36 | Avg(10): -511.85 | Epsilon: 0.305 | Time: 6.35s
Episode 238 | Total Reward: -253.90 | Avg(10): -451.90 | Epsilon: 0.303 | Time: 5.64s
Episode 239 | Total Reward: -354.33 | Avg(10): -413.64 | Epsilon: 0.302 | Time: 5.93s
Episode 240 | Total Reward: -244.07 | Avg(10): -402.11 | Epsilon: 0.300 | Time: 5.80s
Episode 241 | Total Reward: -126.80 | Avg(10): -377.36 | Epsilon: 0.299 | Time: 6.06s
Episode 242 | Total Reward: -125.71 | Avg(10): -340.20 | Epsilon: 0.297 | Time: 6.17s
Episode 243 | Total Reward: -302.97 | Avg(10): -298.05 | Epsilon: 0.296 | Time: 5.89s
Episode 244 | Total Reward: -365.30 | Avg(10): -299.22 | Epsilon: 0.294 | Time: 6.18s
Episode 245 | Total Reward: -2.12 | Avg(10): -263.51 | Epsilon: 0.293 | Time: 5.84s
Episode 246 | Total Reward: -424.04 | Avg(10): -245.46 | Epsilon: 0.291 | Time: 6.03s
Episode 247 | Total Reward: -252.76 | Avg(10): -245.20 | Epsilon: 0.290 | Time: 6.31s
Episode 248 | Total Reward: -640.57 | Avg(10): -283.87 | Epsilon: 0.288 | Time: 6.37s
Episode 249 | Total Reward: -365.62 | Avg(10): -285.00 | Epsilon: 0.287 | Time: 6.16s
Episode 250 | Total Reward: -238.89 | Avg(10): -284.48 | Epsilon: 0.286 | Time: 6.21s
Episode 251 | Total Reward: -248.52 | Avg(10): -296.65 | Epsilon: 0.284 | Time: 5.92s
Episode 252 | Total Reward: -364.06 | Avg(10): -320.49 | Epsilon: 0.283 | Time: 6.06s
Episode 253 | Total Reward: -363.57 | Avg(10): -326.55 | Epsilon: 0.281 | Time: 6.21s
Episode 254 | Total Reward: -626.58 | Avg(10): -352.67 | Epsilon: 0.280 | Time: 6.08s
Episode 255 | Total Reward: -507.24 | Avg(10): -403.19 | Epsilon: 0.279 | Time: 6.28s
Episode 256 | Total Reward: -343.90 | Avg(10): -395.17 | Epsilon: 0.277 | Time: 6.41s
Episode 257 | Total Reward: -127.98 | Avg(10): -382.69 | Epsilon: 0.276 | Time: 6.37s
Episode 258 | Total Reward: -244.22 | Avg(10): -343.06 | Epsilon: 0.274 | Time: 6.30s
Episode 259 | Total Reward: -526.28 | Avg(10): -359.12 | Epsilon: 0.273 | Time: 6.34s
Episode 260 | Total Reward: -245.13 | Avg(10): -359.75 | Epsilon: 0.272 | Time: 6.48s
Episode 261 | Total Reward: -125.59 | Avg(10): -347.46 | Epsilon: 0.270 | Time: 6.37s
Episode 262 | Total Reward: -622.16 | Avg(10): -373.27 | Epsilon: 0.269 | Time: 6.43s
Episode 263 | Total Reward: -128.47 | Avg(10): -349.76 | Epsilon: 0.268 | Time: 6.40s
Episode 264 | Total Reward: -2.21 | Avg(10): -287.32 | Epsilon: 0.266 | Time: 6.26s
Episode 265 | Total Reward: -3.87 | Avg(10): -236.98 | Epsilon: 0.265 | Time: 6.30s
Episode 266 | Total Reward: -542.43 | Avg(10): -256.83 | Epsilon: 0.264 | Time: 6.33s
Episode 267 | Total Reward: -125.55 | Avg(10): -256.59 | Epsilon: 0.262 | Time: 7.28s
Episode 268 | Total Reward: -371.24 | Avg(10): -269.29 | Epsilon: 0.261 | Time: 6.24s
Episode 269 | Total Reward: -128.19 | Avg(10): -229.48 | Epsilon: 0.260 | Time: 6.39s
Episode 270 | Total Reward: -236.45 | Avg(10): -228.62 | Epsilon: 0.258 | Time: 5.89s
Episode 271 | Total Reward: -393.20 | Avg(10): -255.38 | Epsilon: 0.257 | Time: 5.94s
Episode 272 | Total Reward: -123.42 | Avg(10): -205.50 | Epsilon: 0.256 | Time: 5.92s
Episode 273 | Total Reward: -248.09 | Avg(10): -217.46 | Epsilon: 0.255 | Time: 6.16s
Episode 274 | Total Reward: -1.16 | Avg(10): -217.36 | Epsilon: 0.253 | Time: 6.05s
Episode 275 | Total Reward: -1.22 | Avg(10): -217.09 | Epsilon: 0.252 | Time: 6.45s
Episode 276 | Total Reward: -126.09 | Avg(10): -175.46 | Epsilon: 0.251 | Time: 5.59s
Episode 277 | Total Reward: -259.26 | Avg(10): -188.83 | Epsilon: 0.249 | Time: 5.53s
Episode 278 | Total Reward: -460.73 | Avg(10): -197.78 | Epsilon: 0.248 | Time: 5.42s
Episode 279 | Total Reward: -126.04 | Avg(10): -197.57 | Epsilon: 0.247 | Time: 5.47s
Episode 280 | Total Reward: -243.04 | Avg(10): -198.22 | Epsilon: 0.246 | Time: 5.55s
Episode 281 | Total Reward: -124.47 | Avg(10): -171.35 | Epsilon: 0.245 | Time: 5.48s
Episode 282 | Total Reward: -254.92 | Avg(10): -184.50 | Epsilon: 0.243 | Time: 5.80s
Episode 283 | Total Reward: -370.47 | Avg(10): -196.74 | Epsilon: 0.242 | Time: 5.78s
Episode 284 | Total Reward: -245.85 | Avg(10): -221.21 | Epsilon: 0.241 | Time: 5.74s
Episode 285 | Total Reward: -126.65 | Avg(10): -233.75 | Epsilon: 0.240 | Time: 5.25s
Episode 286 | Total Reward: -2.22 | Avg(10): -221.37 | Epsilon: 0.238 | Time: 5.42s
Episode 287 | Total Reward: -126.91 | Avg(10): -208.13 | Epsilon: 0.237 | Time: 8.26s
Episode 288 | Total Reward: -121.78 | Avg(10): -174.23 | Epsilon: 0.236 | Time: 5.95s
Episode 289 | Total Reward: -122.75 | Avg(10): -173.90 | Epsilon: 0.235 | Time: 5.27s
Episode 290 | Total Reward: -246.39 | Avg(10): -174.24 | Epsilon: 0.234 | Time: 5.66s
Episode 291 | Total Reward: -601.38 | Avg(10): -221.93 | Epsilon: 0.233 | Time: 5.51s
Episode 292 | Total Reward: -373.33 | Avg(10): -233.77 | Epsilon: 0.231 | Time: 5.32s
Episode 293 | Total Reward: -1.44 | Avg(10): -196.87 | Epsilon: 0.230 | Time: 5.33s
Episode 294 | Total Reward: -122.89 | Avg(10): -184.57 | Epsilon: 0.229 | Time: 5.29s
Episode 295 | Total Reward: -449.02 | Avg(10): -216.81 | Epsilon: 0.228 | Time: 5.28s
Episode 296 | Total Reward: -124.63 | Avg(10): -229.05 | Epsilon: 0.227 | Time: 5.27s
Episode 297 | Total Reward: -246.33 | Avg(10): -240.99 | Epsilon: 0.226 | Time: 5.36s
Episode 298 | Total Reward: -126.33 | Avg(10): -241.45 | Epsilon: 0.225 | Time: 5.39s
Episode 299 | Total Reward: -492.90 | Avg(10): -278.47 | Epsilon: 0.223 | Time: 5.35s

--- Episode 300: Action Usage Analysis ---
Action distribution: [0.05125 0.02545 0.0338  0.2082  0.0306  0.0235  0.02095 0.02365 0.0206
 0.027   0.03625 0.03925 0.0363  0.0248  0.02055 0.0299  0.01895 0.02305
 0.02425 0.0311  0.2506 ]
Entropy (diversity): 2.580
--------------------------------------------------
Episode 300 | Total Reward: -127.14 | Avg(10): -266.54 | Epsilon: 0.222 | Time: 5.42s
Episode 301 | Total Reward: -122.58 | Avg(10): -218.66 | Epsilon: 0.221 | Time: 5.32s
Episode 302 | Total Reward: -124.69 | Avg(10): -193.80 | Epsilon: 0.220 | Time: 5.16s
Episode 303 | Total Reward: -251.13 | Avg(10): -218.77 | Epsilon: 0.219 | Time: 5.26s
Episode 304 | Total Reward: -517.69 | Avg(10): -258.25 | Epsilon: 0.218 | Time: 5.49s
Episode 305 | Total Reward: -245.17 | Avg(10): -237.86 | Epsilon: 0.217 | Time: 5.20s
Episode 306 | Total Reward: -126.41 | Avg(10): -238.04 | Epsilon: 0.216 | Time: 5.33s
Episode 307 | Total Reward: -239.26 | Avg(10): -237.33 | Epsilon: 0.215 | Time: 5.27s
Episode 308 | Total Reward: -512.47 | Avg(10): -275.95 | Epsilon: 0.214 | Time: 5.50s
Episode 309 | Total Reward: -637.44 | Avg(10): -290.40 | Epsilon: 0.212 | Time: 5.26s
Episode 310 | Total Reward: -124.78 | Avg(10): -290.16 | Epsilon: 0.211 | Time: 5.38s
Episode 311 | Total Reward: -240.37 | Avg(10): -301.94 | Epsilon: 0.210 | Time: 5.42s
Episode 312 | Total Reward: -1.74 | Avg(10): -289.65 | Epsilon: 0.209 | Time: 5.28s
Episode 313 | Total Reward: -362.70 | Avg(10): -300.80 | Epsilon: 0.208 | Time: 5.33s
Episode 314 | Total Reward: -120.26 | Avg(10): -261.06 | Epsilon: 0.207 | Time: 5.43s
Episode 315 | Total Reward: -125.04 | Avg(10): -249.05 | Epsilon: 0.206 | Time: 5.48s
Episode 316 | Total Reward: -122.84 | Avg(10): -248.69 | Epsilon: 0.205 | Time: 5.62s
Episode 317 | Total Reward: -503.15 | Avg(10): -275.08 | Epsilon: 0.204 | Time: 5.53s
Episode 318 | Total Reward: -128.80 | Avg(10): -236.71 | Epsilon: 0.203 | Time: 5.43s
Episode 319 | Total Reward: -117.45 | Avg(10): -184.71 | Epsilon: 0.202 | Time: 5.41s
Episode 320 | Total Reward: -124.18 | Avg(10): -184.65 | Epsilon: 0.201 | Time: 5.31s
Episode 321 | Total Reward: -332.22 | Avg(10): -193.84 | Epsilon: 0.200 | Time: 5.34s
Episode 322 | Total Reward: -125.41 | Avg(10): -206.21 | Epsilon: 0.199 | Time: 5.46s
Episode 323 | Total Reward: -124.14 | Avg(10): -182.35 | Epsilon: 0.198 | Time: 5.44s
Episode 324 | Total Reward: -238.13 | Avg(10): -194.14 | Epsilon: 0.197 | Time: 5.23s
Episode 325 | Total Reward: -122.44 | Avg(10): -193.88 | Epsilon: 0.196 | Time: 5.26s
Episode 326 | Total Reward: -240.61 | Avg(10): -205.65 | Epsilon: 0.195 | Time: 5.23s
Episode 327 | Total Reward: -468.90 | Avg(10): -202.23 | Epsilon: 0.194 | Time: 5.26s
Episode 328 | Total Reward: -118.81 | Avg(10): -201.23 | Epsilon: 0.193 | Time: 5.17s
Episode 329 | Total Reward: -254.93 | Avg(10): -214.98 | Epsilon: 0.192 | Time: 5.24s
Episode 330 | Total Reward: -124.77 | Avg(10): -215.04 | Epsilon: 0.191 | Time: 5.35s
Episode 331 | Total Reward: -356.33 | Avg(10): -217.45 | Epsilon: 0.190 | Time: 5.45s
Episode 332 | Total Reward: -445.36 | Avg(10): -249.44 | Epsilon: 0.189 | Time: 5.22s
Episode 333 | Total Reward: -1.32 | Avg(10): -237.16 | Epsilon: 0.188 | Time: 7.34s
Episode 334 | Total Reward: -2.09 | Avg(10): -213.56 | Epsilon: 0.187 | Time: 5.36s
Episode 335 | Total Reward: -122.94 | Avg(10): -213.61 | Epsilon: 0.187 | Time: 4.98s
Episode 336 | Total Reward: -349.99 | Avg(10): -224.54 | Epsilon: 0.186 | Time: 4.99s
Episode 337 | Total Reward: -121.36 | Avg(10): -189.79 | Epsilon: 0.185 | Time: 5.21s
Episode 338 | Total Reward: -355.15 | Avg(10): -213.42 | Epsilon: 0.184 | Time: 6.34s
Episode 339 | Total Reward: -424.65 | Avg(10): -230.40 | Epsilon: 0.183 | Time: 5.59s
Episode 340 | Total Reward: -124.86 | Avg(10): -230.41 | Epsilon: 0.182 | Time: 5.81s
Episode 341 | Total Reward: -124.42 | Avg(10): -207.21 | Epsilon: 0.181 | Time: 5.59s
Episode 342 | Total Reward: -127.12 | Avg(10): -175.39 | Epsilon: 0.180 | Time: 6.32s
Episode 343 | Total Reward: -121.53 | Avg(10): -187.41 | Epsilon: 0.179 | Time: 6.75s
Episode 344 | Total Reward: -247.07 | Avg(10): -211.91 | Epsilon: 0.178 | Time: 5.37s
Episode 345 | Total Reward: -127.61 | Avg(10): -212.38 | Epsilon: 0.177 | Time: 5.62s
Episode 346 | Total Reward: -1.05 | Avg(10): -177.48 | Epsilon: 0.177 | Time: 5.48s
Episode 347 | Total Reward: -123.98 | Avg(10): -177.74 | Epsilon: 0.176 | Time: 5.75s
Episode 348 | Total Reward: -239.38 | Avg(10): -166.17 | Epsilon: 0.175 | Time: 5.46s
Episode 349 | Total Reward: -246.62 | Avg(10): -148.36 | Epsilon: 0.174 | Time: 5.55s
Episode 350 | Total Reward: -1.98 | Avg(10): -136.08 | Epsilon: 0.173 | Time: 5.45s
Episode 351 | Total Reward: -116.09 | Avg(10): -135.24 | Epsilon: 0.172 | Time: 5.75s
Episode 352 | Total Reward: -227.30 | Avg(10): -145.26 | Epsilon: 0.171 | Time: 5.56s
Episode 353 | Total Reward: -369.33 | Avg(10): -170.04 | Epsilon: 0.170 | Time: 5.54s
Episode 354 | Total Reward: -121.66 | Avg(10): -157.50 | Epsilon: 0.170 | Time: 5.50s
Episode 355 | Total Reward: -244.07 | Avg(10): -169.15 | Epsilon: 0.169 | Time: 5.43s
Episode 356 | Total Reward: -493.90 | Avg(10): -218.43 | Epsilon: 0.168 | Time: 5.67s
Episode 357 | Total Reward: -124.44 | Avg(10): -218.48 | Epsilon: 0.167 | Time: 5.28s
Episode 358 | Total Reward: -119.15 | Avg(10): -206.45 | Epsilon: 0.166 | Time: 5.53s
Episode 359 | Total Reward: -227.48 | Avg(10): -204.54 | Epsilon: 0.165 | Time: 5.43s
Episode 360 | Total Reward: -244.95 | Avg(10): -228.84 | Epsilon: 0.165 | Time: 5.33s
Episode 361 | Total Reward: -253.36 | Avg(10): -242.56 | Epsilon: 0.164 | Time: 5.41s
Episode 362 | Total Reward: -123.87 | Avg(10): -232.22 | Epsilon: 0.163 | Time: 5.32s
Episode 363 | Total Reward: -121.96 | Avg(10): -207.48 | Epsilon: 0.162 | Time: 5.50s
Episode 364 | Total Reward: -244.35 | Avg(10): -219.75 | Epsilon: 0.161 | Time: 5.31s
Episode 365 | Total Reward: -122.95 | Avg(10): -207.64 | Epsilon: 0.160 | Time: 5.66s
Episode 366 | Total Reward: -118.79 | Avg(10): -170.13 | Epsilon: 0.160 | Time: 5.46s
Episode 367 | Total Reward: -118.94 | Avg(10): -169.58 | Epsilon: 0.159 | Time: 5.30s
Episode 368 | Total Reward: -445.05 | Avg(10): -202.17 | Epsilon: 0.158 | Time: 5.20s
Episode 369 | Total Reward: -124.01 | Avg(10): -191.82 | Epsilon: 0.157 | Time: 5.25s
Episode 370 | Total Reward: -241.85 | Avg(10): -191.51 | Epsilon: 0.157 | Time: 5.27s
Episode 371 | Total Reward: -125.37 | Avg(10): -178.71 | Epsilon: 0.156 | Time: 5.36s
Episode 372 | Total Reward: -126.79 | Avg(10): -179.01 | Epsilon: 0.155 | Time: 5.27s
Episode 373 | Total Reward: -240.37 | Avg(10): -190.85 | Epsilon: 0.154 | Time: 5.17s
Episode 374 | Total Reward: -341.26 | Avg(10): -200.54 | Epsilon: 0.153 | Time: 5.26s
Episode 375 | Total Reward: -123.73 | Avg(10): -200.62 | Epsilon: 0.153 | Time: 5.23s
Episode 376 | Total Reward: -123.39 | Avg(10): -201.08 | Epsilon: 0.152 | Time: 5.30s
Episode 377 | Total Reward: -471.76 | Avg(10): -236.36 | Epsilon: 0.151 | Time: 5.33s
Episode 378 | Total Reward: -259.30 | Avg(10): -217.78 | Epsilon: 0.150 | Time: 5.34s
Episode 379 | Total Reward: -1.89 | Avg(10): -205.57 | Epsilon: 0.150 | Time: 5.28s
Episode 380 | Total Reward: -239.49 | Avg(10): -205.34 | Epsilon: 0.149 | Time: 5.19s
Episode 381 | Total Reward: -235.12 | Avg(10): -216.31 | Epsilon: 0.148 | Time: 5.31s
Episode 382 | Total Reward: -120.84 | Avg(10): -215.71 | Epsilon: 0.147 | Time: 5.32s
Episode 383 | Total Reward: -120.53 | Avg(10): -203.73 | Epsilon: 0.147 | Time: 5.30s
Episode 384 | Total Reward: -114.54 | Avg(10): -181.06 | Epsilon: 0.146 | Time: 5.34s
Episode 385 | Total Reward: -254.62 | Avg(10): -194.15 | Epsilon: 0.145 | Time: 5.42s
Episode 386 | Total Reward: -114.68 | Avg(10): -193.28 | Epsilon: 0.144 | Time: 5.41s
Episode 387 | Total Reward: -229.82 | Avg(10): -169.08 | Epsilon: 0.144 | Time: 5.30s
Episode 388 | Total Reward: -270.10 | Avg(10): -170.16 | Epsilon: 0.143 | Time: 5.35s
Episode 389 | Total Reward: -124.16 | Avg(10): -182.39 | Epsilon: 0.142 | Time: 5.31s
Episode 390 | Total Reward: -126.91 | Avg(10): -171.13 | Epsilon: 0.142 | Time: 5.34s
Episode 391 | Total Reward: -251.68 | Avg(10): -172.79 | Epsilon: 0.141 | Time: 5.43s
Episode 392 | Total Reward: -348.41 | Avg(10): -195.55 | Epsilon: 0.140 | Time: 5.59s
Episode 393 | Total Reward: -121.63 | Avg(10): -195.66 | Epsilon: 0.139 | Time: 5.34s
Episode 394 | Total Reward: -342.15 | Avg(10): -218.42 | Epsilon: 0.139 | Time: 5.37s
Episode 395 | Total Reward: -119.68 | Avg(10): -204.92 | Epsilon: 0.138 | Time: 5.40s
Episode 396 | Total Reward: -122.96 | Avg(10): -205.75 | Epsilon: 0.137 | Time: 5.32s
Episode 397 | Total Reward: -236.21 | Avg(10): -206.39 | Epsilon: 0.137 | Time: 5.32s
Episode 398 | Total Reward: -1.09 | Avg(10): -179.49 | Epsilon: 0.136 | Time: 5.52s
Episode 399 | Total Reward: -235.57 | Avg(10): -190.63 | Epsilon: 0.135 | Time: 5.42s

--- Episode 400: Action Usage Analysis ---
Action distribution: [0.0635  0.04065 0.03825 0.0263  0.03205 0.04055 0.04675 0.04215 0.04885
 0.05565 0.04655 0.04385 0.03775 0.048   0.043   0.0499  0.0463  0.0469
 0.0558  0.0632  0.08405]
Entropy (diversity): 3.015
--------------------------------------------------
Episode 400 | Total Reward: -225.84 | Avg(10): -200.52 | Epsilon: 0.135 | Time: 5.36s
Episode 401 | Total Reward: -604.97 | Avg(10): -235.85 | Epsilon: 0.134 | Time: 5.31s
Episode 402 | Total Reward: -243.06 | Avg(10): -225.32 | Epsilon: 0.133 | Time: 5.37s
Episode 403 | Total Reward: -128.12 | Avg(10): -225.97 | Epsilon: 0.133 | Time: 5.32s
Episode 404 | Total Reward: -256.85 | Avg(10): -217.44 | Epsilon: 0.132 | Time: 5.44s
Episode 405 | Total Reward: -0.89 | Avg(10): -205.56 | Epsilon: 0.131 | Time: 5.54s
Episode 406 | Total Reward: -252.36 | Avg(10): -218.50 | Epsilon: 0.131 | Time: 5.26s
Episode 407 | Total Reward: -491.40 | Avg(10): -244.02 | Epsilon: 0.130 | Time: 5.32s
Episode 408 | Total Reward: -244.83 | Avg(10): -268.39 | Epsilon: 0.129 | Time: 5.32s
Episode 409 | Total Reward: -124.73 | Avg(10): -257.30 | Epsilon: 0.129 | Time: 5.47s
Episode 410 | Total Reward: -1.06 | Avg(10): -234.83 | Epsilon: 0.128 | Time: 5.53s
Episode 411 | Total Reward: -118.58 | Avg(10): -186.19 | Epsilon: 0.127 | Time: 5.38s
Episode 412 | Total Reward: -249.98 | Avg(10): -186.88 | Epsilon: 0.127 | Time: 5.36s
Episode 413 | Total Reward: -116.86 | Avg(10): -185.75 | Epsilon: 0.126 | Time: 5.31s
Episode 414 | Total Reward: -370.95 | Avg(10): -197.16 | Epsilon: 0.126 | Time: 5.23s
Episode 415 | Total Reward: -243.78 | Avg(10): -221.45 | Epsilon: 0.125 | Time: 5.18s
Episode 416 | Total Reward: -271.36 | Avg(10): -223.35 | Epsilon: 0.124 | Time: 5.26s
Episode 417 | Total Reward: -124.08 | Avg(10): -186.62 | Epsilon: 0.124 | Time: 5.20s
Episode 418 | Total Reward: -252.68 | Avg(10): -187.41 | Epsilon: 0.123 | Time: 5.61s
Episode 419 | Total Reward: -123.58 | Avg(10): -187.29 | Epsilon: 0.122 | Time: 5.36s
Episode 420 | Total Reward: -225.52 | Avg(10): -209.74 | Epsilon: 0.122 | Time: 5.38s
Episode 421 | Total Reward: -115.25 | Avg(10): -209.40 | Epsilon: 0.121 | Time: 5.41s
Episode 422 | Total Reward: -409.20 | Avg(10): -225.33 | Epsilon: 0.121 | Time: 5.23s
Episode 423 | Total Reward: -2.00 | Avg(10): -213.84 | Epsilon: 0.120 | Time: 5.43s
Episode 424 | Total Reward: -124.66 | Avg(10): -189.21 | Epsilon: 0.119 | Time: 5.23s
Episode 425 | Total Reward: -236.23 | Avg(10): -188.46 | Epsilon: 0.119 | Time: 5.42s
Episode 426 | Total Reward: -234.63 | Avg(10): -184.78 | Epsilon: 0.118 | Time: 5.24s
Episode 427 | Total Reward: -232.34 | Avg(10): -195.61 | Epsilon: 0.118 | Time: 5.32s
Episode 428 | Total Reward: -122.48 | Avg(10): -182.59 | Epsilon: 0.117 | Time: 5.22s
Episode 429 | Total Reward: -1.04 | Avg(10): -170.34 | Epsilon: 0.116 | Time: 5.33s
Episode 430 | Total Reward: -122.03 | Avg(10): -159.99 | Epsilon: 0.116 | Time: 5.41s
Episode 431 | Total Reward: -117.47 | Avg(10): -160.21 | Epsilon: 0.115 | Time: 5.33s
Episode 432 | Total Reward: -119.69 | Avg(10): -131.26 | Epsilon: 0.115 | Time: 5.52s
Episode 433 | Total Reward: -374.98 | Avg(10): -168.55 | Epsilon: 0.114 | Time: 5.40s
Episode 434 | Total Reward: -125.97 | Avg(10): -168.69 | Epsilon: 0.114 | Time: 5.35s
Episode 435 | Total Reward: -117.98 | Avg(10): -156.86 | Epsilon: 0.113 | Time: 5.25s
Episode 436 | Total Reward: -237.48 | Avg(10): -157.15 | Epsilon: 0.112 | Time: 5.29s
Episode 437 | Total Reward: -120.94 | Avg(10): -146.01 | Epsilon: 0.112 | Time: 5.28s
Episode 438 | Total Reward: -227.49 | Avg(10): -156.51 | Epsilon: 0.111 | Time: 6.30s
Episode 439 | Total Reward: -121.12 | Avg(10): -168.51 | Epsilon: 0.111 | Time: 5.66s
Episode 440 | Total Reward: -0.85 | Avg(10): -156.40 | Epsilon: 0.110 | Time: 5.75s
Episode 441 | Total Reward: -477.07 | Avg(10): -192.36 | Epsilon: 0.110 | Time: 5.55s
Episode 442 | Total Reward: -124.08 | Avg(10): -192.80 | Epsilon: 0.109 | Time: 5.48s
Episode 443 | Total Reward: -122.86 | Avg(10): -167.58 | Epsilon: 0.109 | Time: 5.54s
Episode 444 | Total Reward: -119.12 | Avg(10): -166.90 | Epsilon: 0.108 | Time: 5.46s
Episode 445 | Total Reward: -118.56 | Avg(10): -166.96 | Epsilon: 0.107 | Time: 5.41s
Episode 446 | Total Reward: -368.87 | Avg(10): -180.10 | Epsilon: 0.107 | Time: 5.41s
Episode 447 | Total Reward: -241.63 | Avg(10): -192.17 | Epsilon: 0.106 | Time: 5.91s
Episode 448 | Total Reward: -126.38 | Avg(10): -182.05 | Epsilon: 0.106 | Time: 7.02s
Episode 449 | Total Reward: -0.55 | Avg(10): -170.00 | Epsilon: 0.105 | Time: 5.67s
Episode 450 | Total Reward: -244.85 | Avg(10): -194.40 | Epsilon: 0.105 | Time: 5.46s
Episode 451 | Total Reward: -0.77 | Avg(10): -146.77 | Epsilon: 0.104 | Time: 6.12s
Episode 452 | Total Reward: -401.35 | Avg(10): -174.49 | Epsilon: 0.104 | Time: 6.71s
Episode 453 | Total Reward: -122.83 | Avg(10): -174.49 | Epsilon: 0.103 | Time: 5.93s
Episode 454 | Total Reward: -469.62 | Avg(10): -209.54 | Epsilon: 0.103 | Time: 5.87s
Episode 455 | Total Reward: -117.55 | Avg(10): -209.44 | Epsilon: 0.102 | Time: 5.42s
Episode 456 | Total Reward: -1.85 | Avg(10): -172.74 | Epsilon: 0.102 | Time: 5.82s
Episode 457 | Total Reward: -245.54 | Avg(10): -173.13 | Epsilon: 0.101 | Time: 5.63s
Episode 458 | Total Reward: -117.05 | Avg(10): -172.20 | Epsilon: 0.101 | Time: 5.62s
Episode 459 | Total Reward: -245.35 | Avg(10): -196.68 | Epsilon: 0.100 | Time: 5.70s
Episode 460 | Total Reward: -120.14 | Avg(10): -184.20 | Epsilon: 0.100 | Time: 5.27s
Episode 461 | Total Reward: -456.00 | Avg(10): -229.73 | Epsilon: 0.099 | Time: 6.18s
Episode 462 | Total Reward: -121.75 | Avg(10): -201.77 | Epsilon: 0.099 | Time: 6.05s
Episode 463 | Total Reward: -236.01 | Avg(10): -213.09 | Epsilon: 0.098 | Time: 6.24s
Episode 464 | Total Reward: -122.05 | Avg(10): -178.33 | Epsilon: 0.098 | Time: 7.34s
Episode 465 | Total Reward: -228.36 | Avg(10): -189.41 | Epsilon: 0.097 | Time: 6.99s
Episode 466 | Total Reward: -121.45 | Avg(10): -201.37 | Epsilon: 0.097 | Time: 5.38s
Episode 467 | Total Reward: -0.86 | Avg(10): -176.90 | Epsilon: 0.096 | Time: 5.49s
Episode 468 | Total Reward: -1.22 | Avg(10): -165.32 | Epsilon: 0.096 | Time: 5.09s
Episode 469 | Total Reward: -120.58 | Avg(10): -152.84 | Epsilon: 0.095 | Time: 5.31s
Episode 470 | Total Reward: -307.73 | Avg(10): -171.60 | Epsilon: 0.095 | Time: 5.01s
Episode 471 | Total Reward: -125.97 | Avg(10): -138.60 | Epsilon: 0.094 | Time: 5.03s
Episode 472 | Total Reward: -416.42 | Avg(10): -168.07 | Epsilon: 0.094 | Time: 5.46s
Episode 473 | Total Reward: -230.76 | Avg(10): -167.54 | Epsilon: 0.093 | Time: 8.20s
Episode 474 | Total Reward: -2.16 | Avg(10): -155.55 | Epsilon: 0.093 | Time: 8.95s
Episode 475 | Total Reward: -385.95 | Avg(10): -171.31 | Epsilon: 0.092 | Time: 8.86s
Episode 476 | Total Reward: -124.69 | Avg(10): -171.63 | Epsilon: 0.092 | Time: 8.68s
Episode 477 | Total Reward: -116.69 | Avg(10): -183.22 | Epsilon: 0.092 | Time: 8.45s
Episode 478 | Total Reward: -125.32 | Avg(10): -195.63 | Epsilon: 0.091 | Time: 8.71s
Episode 479 | Total Reward: -245.67 | Avg(10): -208.14 | Epsilon: 0.091 | Time: 8.36s
Episode 480 | Total Reward: -283.40 | Avg(10): -205.70 | Epsilon: 0.090 | Time: 8.66s
Episode 481 | Total Reward: -247.32 | Avg(10): -217.84 | Epsilon: 0.090 | Time: 9.17s
Episode 482 | Total Reward: -124.01 | Avg(10): -188.60 | Epsilon: 0.089 | Time: 8.45s
Episode 483 | Total Reward: -125.49 | Avg(10): -178.07 | Epsilon: 0.089 | Time: 8.41s
Episode 484 | Total Reward: -231.47 | Avg(10): -201.00 | Epsilon: 0.088 | Time: 8.39s
Episode 485 | Total Reward: -123.05 | Avg(10): -174.71 | Epsilon: 0.088 | Time: 8.38s
Episode 486 | Total Reward: -123.93 | Avg(10): -174.64 | Epsilon: 0.088 | Time: 8.11s
Episode 487 | Total Reward: -235.54 | Avg(10): -186.52 | Epsilon: 0.087 | Time: 8.43s
Episode 488 | Total Reward: -235.10 | Avg(10): -197.50 | Epsilon: 0.087 | Time: 8.23s
Episode 489 | Total Reward: -121.88 | Avg(10): -185.12 | Epsilon: 0.086 | Time: 8.44s
Episode 490 | Total Reward: -1.25 | Avg(10): -156.90 | Epsilon: 0.086 | Time: 8.59s
Episode 491 | Total Reward: -365.65 | Avg(10): -168.74 | Epsilon: 0.085 | Time: 8.29s
Episode 492 | Total Reward: -121.74 | Avg(10): -168.51 | Epsilon: 0.085 | Time: 8.39s
Episode 493 | Total Reward: -269.96 | Avg(10): -182.96 | Epsilon: 0.084 | Time: 8.69s
Episode 494 | Total Reward: -257.26 | Avg(10): -185.54 | Epsilon: 0.084 | Time: 8.26s
Episode 495 | Total Reward: -122.85 | Avg(10): -185.52 | Epsilon: 0.084 | Time: 8.75s
Episode 496 | Total Reward: -1.28 | Avg(10): -173.25 | Epsilon: 0.083 | Time: 8.38s
Episode 497 | Total Reward: -126.04 | Avg(10): -162.30 | Epsilon: 0.083 | Time: 8.46s
Episode 498 | Total Reward: -1.25 | Avg(10): -138.92 | Epsilon: 0.082 | Time: 8.43s
Episode 499 | Total Reward: -248.98 | Avg(10): -151.63 | Epsilon: 0.082 | Time: 8.34s

--- Episode 500: Action Usage Analysis ---
Action distribution: [0.05255 0.04575 0.03335 0.03225 0.0362  0.04515 0.04435 0.0542  0.05245
 0.06345 0.05755 0.04    0.04845 0.0439  0.0508  0.046   0.0446  0.0514
 0.0632  0.05525 0.03915]
Entropy (diversity): 3.028
--------------------------------------------------
Episode 500 | Total Reward: -350.04 | Avg(10): -186.50 | Epsilon: 0.082 | Time: 8.35s
Episode 501 | Total Reward: -349.78 | Avg(10): -184.92 | Epsilon: 0.081 | Time: 8.30s
Episode 502 | Total Reward: -121.33 | Avg(10): -184.88 | Epsilon: 0.081 | Time: 8.76s
Episode 503 | Total Reward: -117.55 | Avg(10): -169.64 | Epsilon: 0.080 | Time: 8.97s
Episode 504 | Total Reward: -129.71 | Avg(10): -156.88 | Epsilon: 0.080 | Time: 8.88s
Episode 505 | Total Reward: -127.84 | Avg(10): -157.38 | Epsilon: 0.080 | Time: 8.89s
Episode 506 | Total Reward: -123.96 | Avg(10): -169.65 | Epsilon: 0.079 | Time: 8.92s
Episode 507 | Total Reward: -123.79 | Avg(10): -169.42 | Epsilon: 0.079 | Time: 9.08s
Episode 508 | Total Reward: -124.34 | Avg(10): -181.73 | Epsilon: 0.078 | Time: 9.00s
Episode 509 | Total Reward: -121.93 | Avg(10): -169.02 | Epsilon: 0.078 | Time: 8.86s
Episode 510 | Total Reward: -0.63 | Avg(10): -134.08 | Epsilon: 0.078 | Time: 8.95s
Episode 511 | Total Reward: -124.25 | Avg(10): -111.53 | Epsilon: 0.077 | Time: 8.77s
Episode 512 | Total Reward: -123.29 | Avg(10): -111.73 | Epsilon: 0.077 | Time: 8.80s
Episode 513 | Total Reward: -115.52 | Avg(10): -111.52 | Epsilon: 0.076 | Time: 8.90s
Episode 514 | Total Reward: -231.21 | Avg(10): -121.67 | Epsilon: 0.076 | Time: 8.72s
Episode 515 | Total Reward: -123.07 | Avg(10): -121.20 | Epsilon: 0.076 | Time: 8.77s
Episode 516 | Total Reward: -117.03 | Avg(10): -120.51 | Epsilon: 0.075 | Time: 8.85s
Episode 517 | Total Reward: -120.45 | Avg(10): -120.17 | Epsilon: 0.075 | Time: 9.11s
Episode 518 | Total Reward: -366.06 | Avg(10): -144.34 | Epsilon: 0.075 | Time: 8.95s
Episode 519 | Total Reward: -360.45 | Avg(10): -168.20 | Epsilon: 0.074 | Time: 8.84s
Episode 520 | Total Reward: -245.91 | Avg(10): -192.72 | Epsilon: 0.074 | Time: 8.97s
Episode 521 | Total Reward: -120.51 | Avg(10): -192.35 | Epsilon: 0.073 | Time: 9.11s
Episode 522 | Total Reward: -235.87 | Avg(10): -203.61 | Epsilon: 0.073 | Time: 9.21s
Episode 523 | Total Reward: -120.70 | Avg(10): -204.13 | Epsilon: 0.073 | Time: 8.98s
Episode 524 | Total Reward: -358.15 | Avg(10): -216.82 | Epsilon: 0.072 | Time: 9.23s
Episode 525 | Total Reward: -124.77 | Avg(10): -216.99 | Epsilon: 0.072 | Time: 9.05s
Episode 526 | Total Reward: -367.83 | Avg(10): -242.07 | Epsilon: 0.072 | Time: 9.11s
Episode 527 | Total Reward: -1.77 | Avg(10): -230.20 | Epsilon: 0.071 | Time: 8.94s
Episode 528 | Total Reward: -294.26 | Avg(10): -223.02 | Epsilon: 0.071 | Time: 8.77s
Episode 529 | Total Reward: -122.51 | Avg(10): -199.23 | Epsilon: 0.071 | Time: 9.28s
Episode 530 | Total Reward: -114.67 | Avg(10): -186.10 | Epsilon: 0.070 | Time: 8.98s
Episode 531 | Total Reward: -115.59 | Avg(10): -185.61 | Epsilon: 0.070 | Time: 9.50s
Episode 532 | Total Reward: -303.17 | Avg(10): -192.34 | Epsilon: 0.069 | Time: 9.45s
Episode 533 | Total Reward: -233.20 | Avg(10): -203.59 | Epsilon: 0.069 | Time: 8.87s
Episode 534 | Total Reward: -125.75 | Avg(10): -180.35 | Epsilon: 0.069 | Time: 8.77s
Episode 535 | Total Reward: -241.14 | Avg(10): -191.99 | Epsilon: 0.068 | Time: 8.79s
Episode 536 | Total Reward: -125.99 | Avg(10): -167.81 | Epsilon: 0.068 | Time: 8.80s
Episode 537 | Total Reward: -119.44 | Avg(10): -179.57 | Epsilon: 0.068 | Time: 8.54s
Episode 538 | Total Reward: -352.43 | Avg(10): -185.39 | Epsilon: 0.067 | Time: 8.59s
Episode 539 | Total Reward: -118.78 | Avg(10): -185.02 | Epsilon: 0.067 | Time: 8.62s
Episode 540 | Total Reward: -231.56 | Avg(10): -196.71 | Epsilon: 0.067 | Time: 8.51s
Episode 541 | Total Reward: -1.79 | Avg(10): -185.32 | Epsilon: 0.066 | Time: 8.34s
Episode 542 | Total Reward: -125.88 | Avg(10): -167.60 | Epsilon: 0.066 | Time: 8.34s
Episode 543 | Total Reward: -122.84 | Avg(10): -156.56 | Epsilon: 0.066 | Time: 8.50s
Episode 544 | Total Reward: -125.37 | Avg(10): -156.52 | Epsilon: 0.065 | Time: 8.35s
Episode 545 | Total Reward: -122.08 | Avg(10): -144.61 | Epsilon: 0.065 | Time: 8.82s
Episode 546 | Total Reward: -122.08 | Avg(10): -144.22 | Epsilon: 0.065 | Time: 8.55s
Episode 547 | Total Reward: -234.95 | Avg(10): -155.78 | Epsilon: 0.064 | Time: 8.76s
Episode 548 | Total Reward: -324.29 | Avg(10): -152.96 | Epsilon: 0.064 | Time: 8.86s
Episode 549 | Total Reward: -123.60 | Avg(10): -153.44 | Epsilon: 0.064 | Time: 8.52s
Episode 550 | Total Reward: -246.64 | Avg(10): -154.95 | Epsilon: 0.063 | Time: 8.80s
Episode 551 | Total Reward: -124.46 | Avg(10): -167.22 | Epsilon: 0.063 | Time: 8.74s
Episode 552 | Total Reward: -342.78 | Avg(10): -188.91 | Epsilon: 0.063 | Time: 8.66s
Episode 553 | Total Reward: -121.34 | Avg(10): -188.76 | Epsilon: 0.063 | Time: 8.65s
Episode 554 | Total Reward: -117.67 | Avg(10): -187.99 | Epsilon: 0.062 | Time: 8.70s
Episode 555 | Total Reward: -228.48 | Avg(10): -198.63 | Epsilon: 0.062 | Time: 8.26s
Episode 556 | Total Reward: -0.89 | Avg(10): -186.51 | Epsilon: 0.062 | Time: 8.53s
Episode 557 | Total Reward: -125.73 | Avg(10): -175.59 | Epsilon: 0.061 | Time: 8.53s
Episode 558 | Total Reward: -346.36 | Avg(10): -177.80 | Epsilon: 0.061 | Time: 8.51s
Episode 559 | Total Reward: -310.73 | Avg(10): -196.51 | Epsilon: 0.061 | Time: 8.69s
Episode 560 | Total Reward: -229.46 | Avg(10): -194.79 | Epsilon: 0.060 | Time: 8.70s
Episode 561 | Total Reward: -315.48 | Avg(10): -213.89 | Epsilon: 0.060 | Time: 9.13s
Episode 562 | Total Reward: -226.09 | Avg(10): -202.22 | Epsilon: 0.060 | Time: 9.15s
Episode 563 | Total Reward: -116.65 | Avg(10): -201.75 | Epsilon: 0.059 | Time: 9.13s
Episode 564 | Total Reward: -1.77 | Avg(10): -190.16 | Epsilon: 0.059 | Time: 9.35s
Episode 565 | Total Reward: -121.42 | Avg(10): -179.46 | Epsilon: 0.059 | Time: 8.88s
Episode 566 | Total Reward: -117.28 | Avg(10): -191.10 | Epsilon: 0.059 | Time: 9.10s
Episode 567 | Total Reward: -236.12 | Avg(10): -202.14 | Epsilon: 0.058 | Time: 7.74s
Episode 568 | Total Reward: -115.17 | Avg(10): -179.02 | Epsilon: 0.058 | Time: 8.94s
Episode 569 | Total Reward: -242.74 | Avg(10): -172.22 | Epsilon: 0.058 | Time: 8.55s
Episode 570 | Total Reward: -115.58 | Avg(10): -160.83 | Epsilon: 0.057 | Time: 8.23s
Episode 571 | Total Reward: -121.02 | Avg(10): -141.38 | Epsilon: 0.057 | Time: 8.47s
Episode 572 | Total Reward: -234.22 | Avg(10): -142.20 | Epsilon: 0.057 | Time: 8.69s
Episode 573 | Total Reward: -302.47 | Avg(10): -160.78 | Epsilon: 0.057 | Time: 8.16s
Episode 574 | Total Reward: -121.22 | Avg(10): -172.72 | Epsilon: 0.056 | Time: 8.26s
Episode 575 | Total Reward: -114.09 | Avg(10): -171.99 | Epsilon: 0.056 | Time: 8.89s
Episode 576 | Total Reward: -244.94 | Avg(10): -184.76 | Epsilon: 0.056 | Time: 8.57s
Episode 577 | Total Reward: -124.58 | Avg(10): -173.60 | Epsilon: 0.055 | Time: 8.40s
Episode 578 | Total Reward: -254.59 | Avg(10): -187.54 | Epsilon: 0.055 | Time: 8.47s
Episode 579 | Total Reward: -125.40 | Avg(10): -175.81 | Epsilon: 0.055 | Time: 8.38s
Episode 580 | Total Reward: -1.16 | Avg(10): -164.37 | Epsilon: 0.055 | Time: 8.34s
Episode 581 | Total Reward: -121.61 | Avg(10): -164.43 | Epsilon: 0.054 | Time: 8.39s
Episode 582 | Total Reward: -234.49 | Avg(10): -164.46 | Epsilon: 0.054 | Time: 8.16s
Episode 583 | Total Reward: -124.57 | Avg(10): -146.67 | Epsilon: 0.054 | Time: 8.23s
Episode 584 | Total Reward: -0.77 | Avg(10): -134.62 | Epsilon: 0.054 | Time: 8.11s
Episode 585 | Total Reward: -117.06 | Avg(10): -134.92 | Epsilon: 0.053 | Time: 8.03s
Episode 586 | Total Reward: -0.89 | Avg(10): -110.51 | Epsilon: 0.053 | Time: 8.11s
Episode 587 | Total Reward: -120.67 | Avg(10): -110.12 | Epsilon: 0.053 | Time: 8.03s
Episode 588 | Total Reward: -122.14 | Avg(10): -96.88 | Epsilon: 0.052 | Time: 8.14s
Episode 589 | Total Reward: -126.33 | Avg(10): -96.97 | Epsilon: 0.052 | Time: 8.20s
Episode 590 | Total Reward: -122.24 | Avg(10): -109.08 | Epsilon: 0.052 | Time: 8.38s
Episode 591 | Total Reward: -226.23 | Avg(10): -119.54 | Epsilon: 0.052 | Time: 8.21s
Episode 592 | Total Reward: -119.71 | Avg(10): -108.06 | Epsilon: 0.051 | Time: 8.40s
Episode 593 | Total Reward: -358.83 | Avg(10): -131.49 | Epsilon: 0.051 | Time: 8.25s
Episode 594 | Total Reward: -119.30 | Avg(10): -143.34 | Epsilon: 0.051 | Time: 8.14s
Episode 595 | Total Reward: -0.70 | Avg(10): -131.71 | Epsilon: 0.051 | Time: 8.02s
Episode 596 | Total Reward: -1.30 | Avg(10): -131.75 | Epsilon: 0.050 | Time: 8.18s
Episode 597 | Total Reward: -123.31 | Avg(10): -132.01 | Epsilon: 0.050 | Time: 7.93s
Episode 598 | Total Reward: -0.77 | Avg(10): -119.87 | Epsilon: 0.050 | Time: 7.87s
Episode 599 | Total Reward: -0.86 | Avg(10): -107.33 | Epsilon: 0.050 | Time: 7.76s

--- Episode 600: Action Usage Analysis ---
Action distribution: [0.0494  0.0379  0.0305  0.0368  0.0298  0.0421  0.0379  0.051   0.05305
 0.06175 0.0596  0.06035 0.06675 0.0563  0.0537  0.05465 0.0366  0.05515
 0.04555 0.0377  0.04345]
Entropy (diversity): 3.020
--------------------------------------------------
Episode 600 | Total Reward: -118.65 | Avg(10): -106.97 | Epsilon: 0.050 | Time: 7.97s

Evaluating trained model...
Test Episode 1: Total Reward = -122.88
Test Episode 2: Total Reward = -118.08
Test Episode 3: Total Reward = -228.81
Test Episode 4: Total Reward = -116.26
Test Episode 5: Total Reward = -122.03
Test Episode 6: Total Reward = -124.77
Test Episode 7: Total Reward = -117.91
Test Episode 8: Total Reward = -239.32
Test Episode 9: Total Reward = -409.75
Test Episode 10: Total Reward = -237.33

Average Reward over 10 episodes: -183.71 ± 91.25
Best average reward over 10 episodes: -96.88
Best model weights saved to: 21act_600ep_extended_weights.h5
Total training time: 3781.36s

21act_600ep_extended Results:
Training best avg: -96.88
Evaluation: -183.71 ± 91.25
Training time: 3781.4s

================================================================================
EXTENDED TRAINING ANALYSIS
================================================================================

N_ACTIONS=11:
  200 episodes: -435.1 ± 172.8
  Extended:     -273.9 ± 157.8
  Improvement: +161.2 (+37.0%)
  Time cost: 2.2x longer
Extended training HELPS for N_ACTIONS=11

N_ACTIONS=21:
  200 episodes: -679.1 ± 152.3
  Extended:     -183.7 ± 91.2
  Improvement: +495.4 (+72.9%)
  Time cost: 2.8x longer
Extended training HELPS for N_ACTIONS=21

========================================
CONCLUSION:
If extended training doesn't help significantly,
then your friend's epsilon exploration idea becomes relevant!
========================================

I noticeed that whenever I rerun the evaluation codes, the results vary by a lot and it could be because:¶

  1. Unstable Policies: The high variance suggests the learned policies are not robust/stable

  2. Environment Sensitivity: Small differences in initial conditions lead to vastly different outcomes

  3. Limited Episodes: 10 episodes might not be enough to get reliable estimates

  4. Action Space Issues: Larger action spaces (21 actions) show much higher variance

New improvements for evaluation:¶

  1. Statistical Robustness
  • Previous: 10 episodes total, single run
  • Current: 100 episodes total (5 runs × 20 episodes), multiple independent runs
  1. Statistical Analysis
  • Previous: Simple mean ± std
  • Current: Confidence intervals, significance testing, run-to-run variance analysis
  1. Consistency
  • Previous: One evaluation session (could be lucky/unlucky)
  • Current: Multiple runs to assess true performance variability
In [43]:
def evaluate_epsilon_zero_robust(experiment_prefix, n_actions, num_episodes=20, num_runs=5):
    """Robust evaluation with multiple runs to assess variance and confidence"""
    
    # Same parameters as training
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200
    
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    # Recreate agent
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    
    agent.load(SAVE_WEIGHTS_PATH)
    agent.epsilon = 0.0
    
    print(f"\nRobust Evaluation: {experiment_prefix} with epsilon=0.0")
    print(f"Running {num_runs} evaluation sessions of {num_episodes} episodes each")
    print(f"Loaded weights: {SAVE_WEIGHTS_PATH}")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"\n--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        
        run_rewards = []
        
        for ep in range(num_episodes):
            s = env.reset()
            s = s if isinstance(s, np.ndarray) else s[0]
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(s)
                torque = action_index_to_torque(a_idx, n_actions)
                s_next, r, done, info = env.step(torque)
                s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
                total_reward += r
                s = s_next
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f} (min: {min(run_rewards):.1f}, max: {max(run_rewards):.1f})")
    
    # Overall statistics across all runs
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # All individual episode rewards
    all_rewards = []
    for run in all_run_results:
        all_rewards.extend(run['rewards'])
    
    # Confidence interval for the mean
    confidence_level = 0.95
    dof = len(all_means) - 1
    t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
    margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
    ci_lower = overall_mean - margin_of_error
    ci_upper = overall_mean + margin_of_error
    
    print(f"\n--- ROBUST EVALUATION SUMMARY ---")
    print(f"Total episodes: {num_runs * num_episodes}")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print(f"Episode reward range: [{min(all_rewards):.1f}, {max(all_rewards):.1f}]")
    print("-" * 50)
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes
    }
In [44]:
def compare_extended_training_robust():
    """Robust comparison with multiple evaluation runs"""
    
    print("="*80)
    print("ROBUST EXTENDED TRAINING EPSILON=0 EVALUATION")
    print("="*80)
    
    configs = [
        {"name": "5act_200ep_baseline", "n_actions": 5, "episodes": 200},
        {"name": "11act_200ep_baseline", "n_actions": 11, "episodes": 200},
        {"name": "11act_400ep_extended", "n_actions": 11, "episodes": 400},
        {"name": "21act_200ep_baseline", "n_actions": 21, "episodes": 200},
        {"name": "21act_600ep_extended", "n_actions": 21, "episodes": 600},
    ]
    
    results = {}
    
    for config in configs:
        experiment_prefix = config["name"]
        n_actions = config["n_actions"]
        episodes = config["episodes"]
        
        try:
            result = evaluate_epsilon_zero_robust(experiment_prefix, n_actions, num_episodes=20, num_runs=5)
            result['n_actions'] = n_actions
            result['episodes'] = episodes
            results[experiment_prefix] = result
            
            print(f"\n{experiment_prefix}:")
            print(f"  Mean: {result['overall_mean']:.1f} ± {result['overall_std']:.1f}")
            print(f"  95% CI: [{result['ci_lower']:.1f}, {result['ci_upper']:.1f}]")
            
        except FileNotFoundError:
            print(f"{experiment_prefix}: Weights file not found")
    
    return results
In [45]:
def analyze_robust_results(results):
    """Analyze robust evaluation results"""
    
    print("\n" + "="*80)
    print("ROBUST ANALYSIS WITH CONFIDENCE INTERVALS")
    print("="*80)
    
    comparisons = [
        ("11act_200ep_baseline", "11act_400ep_extended", "N_ACTIONS=11"),
        ("21act_200ep_baseline", "21act_600ep_extended", "N_ACTIONS=21")
    ]
    
    for baseline_key, extended_key, label in comparisons:
        if baseline_key in results and extended_key in results:
            baseline = results[baseline_key]
            extended = results[extended_key]
            
            print(f"\n{label}:")
            print(f"  Baseline:  {baseline['overall_mean']:.1f} ± {baseline['overall_std']:.1f} | CI: [{baseline['ci_lower']:.1f}, {baseline['ci_upper']:.1f}]")
            print(f"  Extended:  {extended['overall_mean']:.1f} ± {extended['overall_std']:.1f} | CI: [{extended['ci_lower']:.1f}, {extended['ci_upper']:.1f}]")
            
            improvement = extended['overall_mean'] - baseline['overall_mean']
            
            # Check if confidence intervals overlap
            ci_overlap = not (baseline['ci_upper'] < extended['ci_lower'] or extended['ci_upper'] < baseline['ci_lower'])
            significance = "NOT significant" if ci_overlap else "SIGNIFICANT"
            
            print(f"  Improvement: {improvement:+.1f} ({significance})")
In [46]:
if __name__ == "__main__":
    # Run robust evaluation
    results = compare_extended_training_robust()
    
    # Analyze results with statistical significance
    analyze_robust_results(results)
    
    # Save results
    with open("robust_evaluation_results.json", "w") as f:
        json.dump(results, f, indent=2)
    print("Robust results saved to 'robust_evaluation_results.json'")
================================================================================
ROBUST EXTENDED TRAINING EPSILON=0 EVALUATION
================================================================================

Robust Evaluation: 5act_200ep_baseline with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
Loaded weights: 5act_200ep_baseline_weights.h5

--- Run 1/5 ---
Run 1: -198.8 ± 307.0 (min: -1492.1, max: -1.1)

--- Run 2/5 ---
Run 2: -223.5 ± 99.5 (min: -427.3, max: -116.3)

--- Run 3/5 ---
Run 3: -103.5 ± 83.6 (min: -322.7, max: -0.6)

--- Run 4/5 ---
Run 4: -154.8 ± 75.9 (min: -271.4, max: -0.8)

--- Run 5/5 ---
Run 5: -154.7 ± 101.9 (min: -392.0, max: -0.6)

--- ROBUST EVALUATION SUMMARY ---
Total episodes: 100
Overall mean: -167.05
Run-to-run std: 41.29
95% CI: [-218.31, -115.79]
Episode reward range: [-1492.1, -0.6]
--------------------------------------------------

5act_200ep_baseline:
  Mean: -167.1 ± 41.3
  95% CI: [-218.3, -115.8]

Robust Evaluation: 11act_200ep_baseline with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
Loaded weights: 11act_200ep_baseline_weights.h5

--- Run 1/5 ---
Run 1: -183.3 ± 86.5 (min: -359.2, max: -3.1)

--- Run 2/5 ---
Run 2: -172.7 ± 87.2 (min: -383.1, max: -2.4)

--- Run 3/5 ---
Run 3: -134.3 ± 76.1 (min: -259.1, max: -1.7)

--- Run 4/5 ---
Run 4: -183.0 ± 116.4 (min: -391.5, max: -1.8)

--- Run 5/5 ---
Run 5: -207.0 ± 106.2 (min: -376.0, max: -4.1)

--- ROBUST EVALUATION SUMMARY ---
Total episodes: 100
Overall mean: -176.06
Run-to-run std: 23.73
95% CI: [-205.53, -146.60]
Episode reward range: [-391.5, -1.7]
--------------------------------------------------

11act_200ep_baseline:
  Mean: -176.1 ± 23.7
  95% CI: [-205.5, -146.6]

Robust Evaluation: 11act_400ep_extended with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
Loaded weights: 11act_400ep_extended_weights.h5

--- Run 1/5 ---
Run 1: -175.3 ± 116.4 (min: -520.4, max: -3.4)

--- Run 2/5 ---
Run 2: -178.3 ± 120.2 (min: -518.0, max: -1.5)

--- Run 3/5 ---
Run 3: -122.4 ± 92.6 (min: -249.3, max: -1.4)

--- Run 4/5 ---
Run 4: -170.7 ± 71.1 (min: -266.0, max: -2.1)

--- Run 5/5 ---
Run 5: -162.1 ± 82.6 (min: -347.9, max: -1.2)

--- ROBUST EVALUATION SUMMARY ---
Total episodes: 100
Overall mean: -161.75
Run-to-run std: 20.44
95% CI: [-187.13, -136.38]
Episode reward range: [-520.4, -1.2]
--------------------------------------------------

11act_400ep_extended:
  Mean: -161.8 ± 20.4
  95% CI: [-187.1, -136.4]

Robust Evaluation: 21act_200ep_baseline with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
Loaded weights: 21act_200ep_baseline_weights.h5

--- Run 1/5 ---
Run 1: -196.5 ± 198.7 (min: -882.1, max: -0.7)

--- Run 2/5 ---
Run 2: -224.9 ± 158.1 (min: -762.5, max: -114.6)

--- Run 3/5 ---
Run 3: -165.9 ± 76.7 (min: -354.2, max: -2.0)

--- Run 4/5 ---
Run 4: -308.8 ± 335.3 (min: -1586.3, max: -2.6)

--- Run 5/5 ---
Run 5: -219.6 ± 144.9 (min: -683.9, max: -1.1)

--- ROBUST EVALUATION SUMMARY ---
Total episodes: 100
Overall mean: -223.15
Run-to-run std: 47.62
95% CI: [-282.28, -164.03]
Episode reward range: [-1586.3, -0.7]
--------------------------------------------------

21act_200ep_baseline:
  Mean: -223.2 ± 47.6
  95% CI: [-282.3, -164.0]

Robust Evaluation: 21act_600ep_extended with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
Loaded weights: 21act_600ep_extended_weights.h5

--- Run 1/5 ---
Run 1: -138.9 ± 89.5 (min: -375.7, max: -0.5)

--- Run 2/5 ---
Run 2: -161.1 ± 90.6 (min: -311.4, max: -0.3)

--- Run 3/5 ---
Run 3: -160.0 ± 88.2 (min: -374.4, max: -0.5)

--- Run 4/5 ---
Run 4: -145.0 ± 102.3 (min: -341.9, max: -0.4)

--- Run 5/5 ---
Run 5: -151.1 ± 107.2 (min: -379.7, max: -0.4)

--- ROBUST EVALUATION SUMMARY ---
Total episodes: 100
Overall mean: -151.22
Run-to-run std: 8.55
95% CI: [-161.84, -140.61]
Episode reward range: [-379.7, -0.3]
--------------------------------------------------

21act_600ep_extended:
  Mean: -151.2 ± 8.5
  95% CI: [-161.8, -140.6]

================================================================================
ROBUST ANALYSIS WITH CONFIDENCE INTERVALS
================================================================================

N_ACTIONS=11:
  Baseline:  -176.1 ± 23.7 | CI: [-205.5, -146.6]
  Extended:  -161.8 ± 20.4 | CI: [-187.1, -136.4]
  Improvement: +14.3 (NOT significant)

N_ACTIONS=21:
  Baseline:  -223.2 ± 47.6 | CI: [-282.3, -164.0]
  Extended:  -151.2 ± 8.5 | CI: [-161.8, -140.6]
  Improvement: +71.9 (SIGNIFICANT)
Robust results saved to 'robust_evaluation_results.json'

Observations and Analysis

Result Table¶

N_Actions Episodes Best Training Avg Eval Mean ± Std (Robust) 95% CI Training Time (s) Time/Episode (s) Improvement
5 200 -289.26 -167.1 ± 41.3 [-218.3, -115.8] 1573.8 7.87 Baseline
11 200 -320.54 -176.1 ± 23.7 [-205.5, -146.6] 1535.6 7.68 Baseline
11 400 -110.34 -161.8 ± 20.4 [-187.1, -136.4] 3448.7 8.62 +14.3 (NOT significant)
21 200 -423.64 -223.2 ± 47.6 [-282.3, -164.0] 1354.3 6.77 Baseline
21 600 -96.88 -151.2 ± 8.5 [-161.8, -140.6] 3781.4 6.30 +72.0 (SIGNIFICANT)

Extended Training Effectiveness

  • Significant improvement only for large action spaces: 21 actions showed statistically significant +72.0 improvement (32.3% better performance) with non-overlapping confidence intervals

  • Marginal benefits for medium spaces: 11 actions showed only +14.3 improvement that is NOT statistically significant due to overlapping confidence intervals [-187.1, -136.4] vs [-205.5, -146.6]

  • Improved policy stability with extended training: 21-action extended model shows dramatically lower variance (8.5 vs 47.6) indicating much more stable policy

  • Time efficiency analysis: 11-action extended training provides 2.2x time cost for non-significant improvement, while 21-action extended training provides 2.8x time cost for significant improvement

Action Space Complexity Patterns

  1. Performance vs Complexity Trade-off:

    • 5 actions: -167.1 ± 41.3 (moderate stability)
    • 11 actions: -176.1 ± 23.7 → -161.8 ± 20.4 (slight improvement, better stability)
    • 21 actions: -223.2 ± 47.6 → -151.2 ± 8.5 (major improvement, dramatically better stability)
  2. Statistical significance of learning:

    • 5 actions: Quick convergence, limited improvement potential
    • 11 actions: Extended training shows improvement but NOT statistically significant (confidence intervals overlap)
    • 21 actions: SIGNIFICANT improvement with extended training (non-overlapping confidence intervals)
  3. Policy stability patterns:

    • Baseline models show high variance: 5-act (41.3), 11-act (23.7), 21-act (47.6)
    • Extended training reduces variance: 11-act (20.4), 21-act (8.5 - dramatic improvement)
    • 21-action extended model achieves both better performance AND much more consistent behavior
    • Episode range improvement: 21-act baseline [-1586.3, -0.7] vs extended [-379.7, -0.3]

Key Insights¶

  • Complexity threshold exists: Only 21 actions benefit significantly from extended training
  • Stability is crucial: Extended training for 21 actions achieved both better mean performance and dramatically reduced variance
  • Confidence intervals reveal true significance: Previous analysis may have overestimated improvements without proper statistical testing
In [51]:
def create_extended_training_comparison_plots_improved():
    # Experiment data
    experiments = [
        {"name": "5act_200ep_baseline", "title": "5 Actions, 200 Episodes", "color": "blue"},
        {"name": "11act_200ep_baseline", "title": "11 Actions, 200 Episodes (Baseline)", "color": "orange"},
        {"name": "11act_400ep_extended", "title": "11 Actions, 400 Episodes (Extended)", "color": "red"},
        {"name": "21act_200ep_baseline", "title": "21 Actions, 200 Episodes (Baseline)", "color": "green"},
        {"name": "21act_600ep_extended", "title": "21 Actions, 600 Episodes (Extended)", "color": "purple"},
    ]
    
    # INDIVIDUAL TRAINING PROGRESS PLOTS (Full Size)
    print("Creating individual training progress plots...")
    
    for exp in experiments:
        file_path = f"{exp['name']}_training_plot.png"
        
        if os.path.exists(file_path):
            # Create a full-size display of each training plot
            fig, ax = plt.subplots(1, 1, figsize=(16, 10))
            
            img = plt.imread(file_path)
            ax.imshow(img)
            ax.set_title(f"Training Progress: {exp['title']}", fontsize=20, pad=20)
            ax.axis('off')
            
            # Add a border for better presentation
            for spine in ax.spines.values():
                spine.set_visible(True)
                spine.set_linewidth(2)
                spine.set_edgecolor('black')
            
            plt.tight_layout()
            plt.savefig(f"fullsize_{exp['name']}_training.png", dpi=300, bbox_inches='tight')
            plt.show()
        else:
            print(f"Training plot not found: {file_path}")
    
    # COMBINED TRAINING CURVES (Recreate from data if possible)
    print("\nCreating combined training curves...")
    
    # PERFORMANCE COMPARISON (Enhanced) - UPDATED WITH ROBUST EVALUATION DATA
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 16))
    fig.suptitle('Extended Training Experiment: Robust Evaluation Analysis', fontsize=24, y=0.95)
    
    # UPDATED DATA from robust evaluation
    experiment_names = ["5 Act\n200 Ep", "11 Act\n200 Ep", "11 Act\n400 Ep", "21 Act\n200 Ep", "21 Act\n600 Ep"]
    eval_means = [-166.7, -168.8, -153.3, -228.5, -164.2]  # Updated robust means
    eval_stds = [19.7, 27.0, 18.6, 42.6, 9.2]  # Updated robust std (run-to-run)
    training_bests = [-289.26, -320.54, -110.34, -423.64, -96.88]  # Same training data
    training_times = [1573.8, 1535.6, 3448.7, 1354.3, 3781.4]
    episodes = [200, 200, 400, 200, 600]
    final_entropies = [1.092, 1.094, 1.111, 1.368, 1.622]  # Updated entropy values
    
    # Colors: baseline vs extended
    colors = ['lightblue', 'lightcoral', 'darkred', 'lightcoral', 'darkred']
    edge_colors = ['blue', 'red', 'darkred', 'green', 'purple']
    
    # Plot 1: Robust Evaluation Performance with Confidence Intervals
    bars1 = ax1.bar(experiment_names, eval_means, yerr=eval_stds, 
                color=colors, edgecolor=edge_colors, linewidth=2,
                alpha=0.8, capsize=8)
    ax1.set_title('Robust Evaluation Performance (100 episodes, 5 runs)', fontsize=16, pad=15)
    ax1.set_ylabel('Average Reward ± Run-to-Run Std Dev', fontsize=12)
    ax1.axhline(y=-150, color='green', linestyle='--', linewidth=2, alpha=0.7, label='Good Performance (-150)')
    ax1.axhline(y=-200, color='orange', linestyle='--', linewidth=2, alpha=0.7, label='Fair Performance (-200)')
    ax1.legend(fontsize=11)
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-280, -120)
    
    # Add significance indicators
    significance_indicators = ["", "", "NS", "", "**"]  # NS = Not Significant, ** = Significant
    for bar, mean, std, sig in zip(bars1, eval_means, eval_stds, significance_indicators):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + std + 5,
                f'{mean:.0f}±{std:.0f}\n{sig}', ha='center', va='bottom', 
                fontsize=11, fontweight='bold')
    
    # Plot 2: Training Best Performance
    bars2 = ax2.bar(experiment_names, training_bests, 
                   color=colors, edgecolor=edge_colors, linewidth=2, alpha=0.8)
    ax2.set_title('Best Training Performance (10-Episode Average)', fontsize=16, pad=15)
    ax2.set_ylabel('Best Training Reward', fontsize=12)
    ax2.axhline(y=-200, color='green', linestyle='--', linewidth=2, alpha=0.7, label='Good Performance (-200)')
    ax2.legend(fontsize=11)
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-500, -50)
    
    for bar, value in zip(bars2, training_bests):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + 10,
                f'{value:.0f}', ha='center', va='bottom', 
                fontsize=11, fontweight='bold')
    
    # Plot 3: Action Diversity (Final Entropy)
    bars3 = ax3.bar(experiment_names, final_entropies,
                   color=colors, edgecolor=edge_colors, linewidth=2, alpha=0.8)
    ax3.set_title('Action Usage Diversity (Final Entropy)', fontsize=16, pad=15)
    ax3.set_ylabel('Entropy (Higher = More Diverse)', fontsize=12)
    ax3.grid(True, alpha=0.3)
    ax3.set_ylim(0, 1.8)
    
    for bar, entropy in zip(bars3, final_entropies):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 0.02,
                f'{entropy:.3f}', ha='center', va='bottom', 
                fontsize=11, fontweight='bold')
    
    # Plot 4: Statistical Significance Analysis
    # Only show comparisons where we have baseline vs extended
    comparison_labels = ['11 Actions\n(200→400 episodes)', '21 Actions\n(200→600 episodes)']
    
    # Calculate improvements
    improvement_11 = -153.3 - (-168.8)  # +15.5
    improvement_21 = -164.2 - (-228.5)  # +64.3
    improvements = [improvement_11, improvement_21]
    
    # Statistical significance
    significance = ['NOT Significant', 'SIGNIFICANT']
    colors_sig = ['orange', 'green']
    
    bars4 = ax4.bar(range(len(improvements)), improvements, 
                   color=colors_sig, alpha=0.8, width=0.6)
    ax4.set_title('Extended Training Improvement (Statistical Analysis)', fontsize=16, pad=15)
    ax4.set_ylabel('Performance Improvement (Reward Points)', fontsize=12)
    ax4.set_xticks(range(len(improvements)))
    ax4.set_xticklabels(comparison_labels)
    ax4.grid(True, alpha=0.3)
    ax4.axhline(y=0, color='black', linestyle='-', linewidth=1)
    
    for i, (bar, improvement, sig) in enumerate(zip(bars4, improvements, significance)):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height + 2,
                f'+{improvement:.1f}\n({sig})', 
                ha='center', va='bottom', fontsize=11, fontweight='bold')
    
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    plt.savefig("extended_training_robust_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
    
    # ACTION SPACE SCALING ANALYSIS (UPDATED)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Performance vs Action Space Size (Robust Data)
    action_spaces = [5, 11, 21]
    baseline_200ep = [-166.7, -168.8, -228.5]  # Updated robust baseline data
    extended_performance = [-166.7, -153.3, -164.2]  # 5 acts same, others extended
    
    ax1.plot(action_spaces, baseline_200ep, 'o-', linewidth=3, markersize=8, 
             label='200 Episodes (Baseline)', color='red', alpha=0.7)
    ax1.plot([5, 11, 21], extended_performance, 's-', linewidth=3, markersize=8,
             label='Extended Training', color='green', alpha=0.8)
    ax1.set_title('Robust Performance vs Action Space Size', fontsize=16, pad=15)
    ax1.set_xlabel('Number of Actions', fontsize=12)
    ax1.set_ylabel('Mean Evaluation Performance (Robust)', fontsize=12)
    ax1.legend(fontsize=12)
    ax1.grid(True, alpha=0.3)
    ax1.set_xticks(action_spaces)
    ax1.set_ylim(-250, -140)
    
    # Add value annotations
    for x, y in zip(action_spaces, baseline_200ep):
        ax1.annotate(f'{y:.0f}', (x, y), textcoords="offset points", 
                    xytext=(0,10), ha='center', fontsize=10)
    for x, y in zip([5, 11, 21], extended_performance):
        ax1.annotate(f'{y:.0f}', (x, y), textcoords="offset points", 
                    xytext=(0,-15), ha='center', fontsize=10)
    
    # Entropy vs Action Space
    entropy_baseline = [1.092, 1.094, 1.368]
    entropy_extended = [1.092, 1.111, 1.622]
    
    ax2.plot(action_spaces, entropy_baseline, 'o-', linewidth=3, markersize=8, 
             label='200 Episodes (Baseline)', color='red', alpha=0.7)
    ax2.plot([5, 11, 21], entropy_extended, 's-', linewidth=3, markersize=8,
             label='Extended Training', color='green', alpha=0.8)
    ax2.set_title('Action Diversity vs Action Space Size', fontsize=16, pad=15)
    ax2.set_xlabel('Number of Actions', fontsize=12)
    ax2.set_ylabel('Final Action Entropy', fontsize=12)
    ax2.grid(True, alpha=0.3)
    ax2.set_xticks(action_spaces)
    ax2.legend(fontsize=12)
    
    for x, y in zip(action_spaces, entropy_baseline):
        ax2.annotate(f'{y:.3f}', (x, y), textcoords="offset points", 
                    xytext=(0,10), ha='center', fontsize=10)
    for x, y in zip([5, 11, 21], entropy_extended):
        ax2.annotate(f'{y:.3f}', (x, y), textcoords="offset points", 
                    xytext=(0,-15), ha='center', fontsize=10)
    
    plt.tight_layout()
    plt.savefig("action_space_robust_scaling_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n" + "="*60)
    print("ROBUST VISUALIZATION SUMMARY")
    print("="*60)
    print("Updated with robust evaluation data:")
    print("- Statistical significance indicators")
    print("- Confidence intervals from multiple runs")
    print("- Corrected entropy values")
    print("- Proper performance scaling analysis")
    print("\nCreated files:")
    print("Individual full-size training plots:")
    for exp in experiments:
        print(f"   - fullsize_{exp['name']}_training.png")
    print("Robust analysis:")
    print("   - extended_training_robust_analysis.png")
    print("   - action_space_robust_scaling_analysis.png")
In [52]:
create_extended_training_comparison_plots_improved()
Creating individual training progress plots...
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Creating combined training curves...
No description has been provided for this image
No description has been provided for this image
============================================================
ROBUST VISUALIZATION SUMMARY
============================================================
Updated with robust evaluation data:
- Statistical significance indicators
- Confidence intervals from multiple runs
- Corrected entropy values
- Proper performance scaling analysis

Created files:
Individual full-size training plots:
   - fullsize_5act_200ep_baseline_training.png
   - fullsize_11act_200ep_baseline_training.png
   - fullsize_11act_400ep_extended_training.png
   - fullsize_21act_200ep_baseline_training.png
   - fullsize_21act_600ep_extended_training.png
Robust analysis:
   - extended_training_robust_analysis.png
   - action_space_robust_scaling_analysis.png

Observations from plots

  1. 5 Actions, 200 Episodes
  • Convergence: Episode ~150-175
  • Pattern: Smooth learning curve, reaches plateau naturally
  • Efficiency: Good stopping point at 200 episodes
  1. 11 Actions, 200 Episodes (Baseline)
  • Convergence: Still improving at episode 200
  • Pattern: Linear improvement from episode 100-200
  • Issue: Training stopped too early - still learning!
  1. 11 Actions, 400 Episodes (Extended)
  • Convergence: Episode ~250-300
  • Pattern: Major breakthrough around episode 200, then plateau
  • Sweet spot: Could have stopped at episode 300
  1. 21 Actions, 200 Episodes (Baseline)
  • Convergence: Barely started learning
  • Pattern: Just beginning to improve around episode 150-200
  • Issue: Severely undertrained - needs much more time
  1. 21 Actions, 600 Episodes (Extended)
  • Convergence: Episode ~250-300
  • Pattern: Steep improvement from episodes 0-250, then plateau around -200

I- nsight: Training could stop at episode 350 - overtraining by ~250 episodes

Learning Phase Patterns:¶

All configurations show 3 phases:

  • Exploration phase (0-100 episodes): High variance, poor performance
  • Learning phase (100-250 episodes): Rapid improvement
  • Convergence phase (250+ episodes): Plateau, minimal gains

Epsilon Decay Analysis:¶

  • Looking at the exploration rate plots:
    • All reach epsilon ~0.4 by episode 200
    • Minimal exploration after episode 300 (epsilon ~0.2)
    • This correlates with plateau timing

learning progress/improvement Over Time Insights:¶

  • The most revealing plot, 11 actions, 400 episodes: Shows negative learning around episode 150-200 (the dip), then recovery
  • This explains the breakthrough pattern!
In [46]:
def generate_21action_gifs():
    """Generate GIFs for 21-action experiments only - save to files"""
    
    input_shape = 3
    
    # 21-action experiments
    experiments = [
        {
            "name": "21act_600ep_extended", 
            "n_actions": 21,
            "checkpoints": [100, 200, 300, 400, 500, 600]
        }
    ]
    
    for exp in experiments:
        for ep in exp['checkpoints']:
            weights_path = f"{exp['name']}_{ep}_weights.h5"
            gif_path = f"{exp['name']}_ep{ep:03d}.gif"
            
            if os.path.exists(weights_path):
                print(f"Generating GIF for: {weights_path}")
                try:
                    visualize_checkpoint(
                        weights_path=weights_path,
                        n_actions=exp['n_actions'], 
                        gif_path=gif_path,
                        input_shape=input_shape
                    )
                except Exception as e:
                    print(f"Failed at {weights_path}: {e}")
            else:
                print(f"File not found: {weights_path}")
In [47]:
generate_21action_gifs()
Generating GIF for: 21act_600ep_extended_100_weights.h5
Saved GIF to 21act_600ep_extended_ep100.gif (Total reward: -1523.02)
Generating GIF for: 21act_600ep_extended_200_weights.h5
Saved GIF to 21act_600ep_extended_ep200.gif (Total reward: -454.99)
Generating GIF for: 21act_600ep_extended_300_weights.h5
Saved GIF to 21act_600ep_extended_ep300.gif (Total reward: -128.04)
Generating GIF for: 21act_600ep_extended_400_weights.h5
Saved GIF to 21act_600ep_extended_ep400.gif (Total reward: -227.62)
Generating GIF for: 21act_600ep_extended_500_weights.h5
Saved GIF to 21act_600ep_extended_ep500.gif (Total reward: -245.00)
Generating GIF for: 21act_600ep_extended_600_weights.h5
Saved GIF to 21act_600ep_extended_ep600.gif (Total reward: -244.92)

Research Decision & Next Steps¶

Final Action Space Choice: 21 Actions

Rationale:

  • Shows statistically significant improvement with extended training (+64.3 reward improvement, non-overlapping confidence intervals)
  • Only configuration where extended training provides meaningful, measurable benefits
  • Complex enough to demonstrate advanced RL techniques and optimization challenges
  • Reveals interesting convergence patterns requiring 3x training time (200→600 episodes)
  • Improved policy stability: Much lower run-to-run variance (42.6→9.2) indicates more reliable learned behavior
  • Enhanced action diversity: Entropy increased from 1.368 to 1.622, showing better exploration of action space

Implementation of Early Stopping and Confirmation of Convergence¶

In [11]:
def train_and_evaluate_with_early_stopping(n_actions, experiment_prefix, max_episodes=500, patience=50, min_episodes=150, improvement_threshold=10):
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200

    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    TRAIN_PLOT_PATH = f"{experiment_prefix}_training_plot.png"
    EPISODE_TIMES_PATH = f"{experiment_prefix}_episode_times.png"
    EVAL_RETURNS_PATH = f"{experiment_prefix}_eval_returns.png"

    env = gym.make(ENV_NAME)
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    agent.summary()
    
    scores = []
    best_avg_reward = -np.inf
    episode_times = []
    episodes_without_improvement = 0
    convergence_episode = None
    
    start = time.time()

    for ep in range(1, max_episodes + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            agent.remember(s, a_idx, r, s_next, done)
            agent.train_step()
            s = s_next
            total_reward += r
            if done:
                break

        agent.decay_epsilon()
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints at regular intervals
        if ep in [100, 200, 300, 400, 500]:
            agent.save(f"{experiment_prefix}_{ep}_weights.h5")
        
        scores.append(total_reward)
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        episode_times.append(ep_time)
        
        print(f"Episode {ep} | Total Reward: {total_reward:.2f} | Avg(10): {avg_reward:.2f} | Epsilon: {agent.epsilon:.3f} | Time: {ep_time:.2f}s")

        # Early stopping logic
        if ep >= min_episodes:
            if avg_reward > best_avg_reward + improvement_threshold:
                best_avg_reward = avg_reward
                episodes_without_improvement = 0
                convergence_episode = ep
                agent.save(SAVE_WEIGHTS_PATH)
                print(f"NEW BEST at Episode {ep}: {avg_reward:.2f} (improved by {avg_reward - (best_avg_reward - improvement_threshold):.2f})")
            else:
                episodes_without_improvement += 1
                
            # Check for early stopping
            if episodes_without_improvement >= patience:
                print(f"\nEARLY STOPPING at Episode {ep}")
                print(f"No improvement for {patience} episodes")
                print(f"Last improvement at episode: {convergence_episode}")
                break
        else:
            # Before min_episodes, just track best
            if avg_reward > best_avg_reward:
                best_avg_reward = avg_reward
                agent.save(SAVE_WEIGHTS_PATH)

    env.close()
    final_episode = ep
    total_time = time.time() - start

    # Plot rewards with early stopping indicators
    plt.figure(figsize=(15, 10))
    
    # Main training plot
    plt.subplot(2, 2, 1)
    plt.plot(scores, alpha=0.6, label='Episode Reward')
    plt.plot([np.mean(scores[max(0, i-9):i+1]) for i in range(len(scores))], 'r-', linewidth=2, label='Moving Avg (10)')
    if convergence_episode:
        plt.axvline(x=convergence_episode, color='green', linestyle='--', label=f'Best Performance (Ep {convergence_episode})')
    plt.axvline(x=final_episode, color='red', linestyle='--', label=f'Early Stop (Ep {final_episode})')
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title(f'DQN Training with Early Stopping ({n_actions} actions)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Episode times
    plt.subplot(2, 2, 2)
    plt.plot(episode_times)
    plt.xlabel('Episode')
    plt.ylabel('Time (s)')
    plt.title('Time per Episode')
    plt.grid(True, alpha=0.3)
    
    # Performance distribution
    plt.subplot(2, 2, 3)
    plt.hist(scores, bins=20, alpha=0.7)
    plt.axvline(x=best_avg_reward, color='red', linestyle='--', label=f'Best Avg: {best_avg_reward:.1f}')
    plt.xlabel('Episode Reward')
    plt.ylabel('Frequency')
    plt.title('Training Reward Distribution')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Early stopping monitoring
    plt.subplot(2, 2, 4)
    if len(scores) >= min_episodes:
        improvement_history = []
        for i in range(min_episodes, len(scores)):
            current_avg = np.mean(scores[max(0, i-9):i+1])
            if i <= final_episode:
                episodes_since_best = i - (convergence_episode if convergence_episode and convergence_episode <= i else i)
                improvement_history.append(min(episodes_since_best, patience))
        
        plt.plot(range(min_episodes, min_episodes + len(improvement_history)), improvement_history, 'r-', linewidth=2)
        plt.axhline(y=patience, color='red', linestyle='--', label=f'Patience Limit ({patience})')
        plt.xlabel('Episode')
        plt.ylabel('Episodes Without Improvement')
        plt.title('Early Stopping Monitor')
        plt.legend()
        plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(TRAIN_PLOT_PATH, dpi=300, bbox_inches='tight')
    plt.show()

    print(f"\nTRAINING COMPLETED")
    print(f"Episodes trained: {final_episode}")
    print(f"Convergence episode: {convergence_episode}")
    print(f"Best average reward over 10 episodes: {best_avg_reward:.2f}")
    print("Best model weights saved to:", SAVE_WEIGHTS_PATH)
    print(f"Total training time: {total_time:.2f}s")
    print(f"Time per episode: {total_time/final_episode:.2f}s")

    # --- Evaluation (same as before) ---
    print(f"\nEvaluating trained model...")
    env = gym.make(ENV_NAME)
    agent.load(SAVE_WEIGHTS_PATH)
    rewards = []
    episode_states = []
    
    for ep in range(10):
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0
        states = []
        for t in range(MAX_STEPS):
            states.append(s)
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            total_reward += r
            s = s_next
            if done:
                break
        rewards.append(total_reward)
        episode_states.append(states)
        print(f"Test Episode {ep+1}: Total Reward = {total_reward:.2f}")
    
    env.close()
    print(f"\nAverage Reward over 10 episodes: {np.mean(rewards):.2f} ± {np.std(rewards):.2f}")

    # Plot evaluation returns
    plt.figure()
    plt.hist(rewards, bins=10, alpha=0.7, color='green')
    plt.axvline(x=np.mean(rewards), color='red', linestyle='--', label=f'Mean: {np.mean(rewards):.1f}')
    plt.title(f'Evaluation Returns ({n_actions} actions, {final_episode} episodes)')
    plt.xlabel('Total Reward')
    plt.ylabel('Count')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.savefig(EVAL_RETURNS_PATH)
    plt.show()
    
    return {
        'final_episode': final_episode,
        'convergence_episode': convergence_episode,
        'best_avg_reward': best_avg_reward,
        'eval_mean': np.mean(rewards),
        'eval_std': np.std(rewards),
        'training_time': total_time,
        'time_per_episode': total_time/final_episode
    }
In [51]:
if __name__ == "__main__":
    # Set seeds for reproducibility
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    # Run 21 actions with early stopping
    n_actions = 21
    experiment_prefix = "21act_early_stopping"
    
    print("="*60)
    print(f"Running 21 Actions with Early Stopping")
    print("="*60)
    
    results = train_and_evaluate_with_early_stopping(
        n_actions=n_actions, 
        experiment_prefix=experiment_prefix,
        max_episodes=500,
        patience=50,
        min_episodes=150,
        improvement_threshold=10
    )
    
    print("="*60)
    print("EARLY STOPPING EXPERIMENT RESULTS:")
    print(f"Training stopped at episode: {results['final_episode']}")
    print(f"Expected time savings vs 600ep: {((600-results['final_episode'])/600)*100:.1f}%")
    print(f"Performance: {results['eval_mean']:.2f} ± {results['eval_std']:.2f}")
    print("="*60)
============================================================
Running 21 Actions with Early Stopping
============================================================

Model Summary:
Model: "dqn_34"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_102 (Dense)           multiple                  256       
                                                                 
 dense_103 (Dense)           multiple                  4160      
                                                                 
 dense_104 (Dense)           multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Episode 1 | Total Reward: -1500.31 | Avg(10): -1500.31 | Epsilon: 0.995 | Time: 0.04s
Episode 2 | Total Reward: -1160.00 | Avg(10): -1330.16 | Epsilon: 0.990 | Time: 0.05s
Episode 3 | Total Reward: -912.44 | Avg(10): -1190.92 | Epsilon: 0.985 | Time: 0.04s
Episode 4 | Total Reward: -1072.11 | Avg(10): -1161.22 | Epsilon: 0.980 | Time: 0.05s
Episode 5 | Total Reward: -1731.83 | Avg(10): -1275.34 | Epsilon: 0.975 | Time: 0.18s
Episode 6 | Total Reward: -963.07 | Avg(10): -1223.29 | Epsilon: 0.970 | Time: 7.06s
Episode 7 | Total Reward: -887.54 | Avg(10): -1175.33 | Epsilon: 0.966 | Time: 7.03s
Episode 8 | Total Reward: -960.53 | Avg(10): -1148.48 | Epsilon: 0.961 | Time: 6.99s
Episode 9 | Total Reward: -950.87 | Avg(10): -1126.52 | Epsilon: 0.956 | Time: 7.36s
Episode 10 | Total Reward: -1148.41 | Avg(10): -1128.71 | Epsilon: 0.951 | Time: 7.52s
Episode 11 | Total Reward: -1350.47 | Avg(10): -1113.73 | Epsilon: 0.946 | Time: 7.36s
Episode 12 | Total Reward: -913.28 | Avg(10): -1089.05 | Epsilon: 0.942 | Time: 7.40s
Episode 13 | Total Reward: -1491.09 | Avg(10): -1146.92 | Epsilon: 0.937 | Time: 7.33s
Episode 14 | Total Reward: -1152.65 | Avg(10): -1154.97 | Epsilon: 0.932 | Time: 7.35s
Episode 15 | Total Reward: -1448.29 | Avg(10): -1126.62 | Epsilon: 0.928 | Time: 7.29s
Episode 16 | Total Reward: -1207.63 | Avg(10): -1151.07 | Epsilon: 0.923 | Time: 7.33s
Episode 17 | Total Reward: -764.55 | Avg(10): -1138.78 | Epsilon: 0.918 | Time: 7.18s
Episode 18 | Total Reward: -1028.93 | Avg(10): -1145.62 | Epsilon: 0.914 | Time: 6.96s
Episode 19 | Total Reward: -877.15 | Avg(10): -1138.24 | Epsilon: 0.909 | Time: 7.07s
Episode 20 | Total Reward: -961.53 | Avg(10): -1119.56 | Epsilon: 0.905 | Time: 7.02s
Episode 21 | Total Reward: -1313.34 | Avg(10): -1115.84 | Epsilon: 0.900 | Time: 6.92s
Episode 22 | Total Reward: -750.12 | Avg(10): -1099.53 | Epsilon: 0.896 | Time: 7.02s
Episode 23 | Total Reward: -1588.45 | Avg(10): -1109.26 | Epsilon: 0.891 | Time: 7.15s
Episode 24 | Total Reward: -1064.23 | Avg(10): -1100.42 | Epsilon: 0.887 | Time: 7.11s
Episode 25 | Total Reward: -1178.79 | Avg(10): -1073.47 | Epsilon: 0.882 | Time: 7.53s
Episode 26 | Total Reward: -876.13 | Avg(10): -1040.32 | Epsilon: 0.878 | Time: 7.85s
Episode 27 | Total Reward: -1485.12 | Avg(10): -1112.38 | Epsilon: 0.873 | Time: 7.41s
Episode 28 | Total Reward: -1736.84 | Avg(10): -1183.17 | Epsilon: 0.869 | Time: 7.33s
Episode 29 | Total Reward: -1283.22 | Avg(10): -1223.78 | Epsilon: 0.865 | Time: 7.72s
Episode 30 | Total Reward: -1169.19 | Avg(10): -1244.54 | Epsilon: 0.860 | Time: 7.31s
Episode 31 | Total Reward: -1542.74 | Avg(10): -1267.48 | Epsilon: 0.856 | Time: 7.25s
Episode 32 | Total Reward: -1340.81 | Avg(10): -1326.55 | Epsilon: 0.852 | Time: 7.29s
Episode 33 | Total Reward: -1227.92 | Avg(10): -1290.50 | Epsilon: 0.848 | Time: 7.38s
Episode 34 | Total Reward: -1304.81 | Avg(10): -1314.56 | Epsilon: 0.843 | Time: 7.05s
Episode 35 | Total Reward: -1703.46 | Avg(10): -1367.02 | Epsilon: 0.839 | Time: 6.99s
Episode 36 | Total Reward: -1001.70 | Avg(10): -1379.58 | Epsilon: 0.835 | Time: 7.23s
Episode 37 | Total Reward: -1453.79 | Avg(10): -1376.45 | Epsilon: 0.831 | Time: 7.37s
Episode 38 | Total Reward: -1509.50 | Avg(10): -1353.71 | Epsilon: 0.827 | Time: 7.22s
Episode 39 | Total Reward: -904.15 | Avg(10): -1315.81 | Epsilon: 0.822 | Time: 9.76s
Episode 40 | Total Reward: -1223.95 | Avg(10): -1321.28 | Epsilon: 0.818 | Time: 7.95s
Episode 41 | Total Reward: -1119.04 | Avg(10): -1278.91 | Epsilon: 0.814 | Time: 7.23s
Episode 42 | Total Reward: -1562.06 | Avg(10): -1301.04 | Epsilon: 0.810 | Time: 7.43s
Episode 43 | Total Reward: -1385.56 | Avg(10): -1316.80 | Epsilon: 0.806 | Time: 7.50s
Episode 44 | Total Reward: -852.62 | Avg(10): -1271.58 | Epsilon: 0.802 | Time: 7.62s
Episode 45 | Total Reward: -1666.15 | Avg(10): -1267.85 | Epsilon: 0.798 | Time: 7.44s
Episode 46 | Total Reward: -1220.19 | Avg(10): -1289.70 | Epsilon: 0.794 | Time: 7.40s
Episode 47 | Total Reward: -875.73 | Avg(10): -1231.89 | Epsilon: 0.790 | Time: 7.37s
Episode 48 | Total Reward: -893.25 | Avg(10): -1170.27 | Epsilon: 0.786 | Time: 7.37s
Episode 49 | Total Reward: -871.93 | Avg(10): -1167.05 | Epsilon: 0.782 | Time: 7.43s
Episode 50 | Total Reward: -1197.93 | Avg(10): -1164.44 | Epsilon: 0.778 | Time: 7.27s
Episode 51 | Total Reward: -1080.21 | Avg(10): -1160.56 | Epsilon: 0.774 | Time: 7.29s
Episode 52 | Total Reward: -1609.12 | Avg(10): -1165.27 | Epsilon: 0.771 | Time: 7.14s
Episode 53 | Total Reward: -980.73 | Avg(10): -1124.79 | Epsilon: 0.767 | Time: 7.11s
Episode 54 | Total Reward: -977.25 | Avg(10): -1137.25 | Epsilon: 0.763 | Time: 7.25s
Episode 55 | Total Reward: -1378.59 | Avg(10): -1108.49 | Epsilon: 0.759 | Time: 7.27s
Episode 56 | Total Reward: -1191.04 | Avg(10): -1105.58 | Epsilon: 0.755 | Time: 7.85s
Episode 57 | Total Reward: -1234.97 | Avg(10): -1141.50 | Epsilon: 0.751 | Time: 7.24s
Episode 58 | Total Reward: -1377.49 | Avg(10): -1189.93 | Epsilon: 0.748 | Time: 7.87s
Episode 59 | Total Reward: -1278.23 | Avg(10): -1230.56 | Epsilon: 0.744 | Time: 7.79s
Episode 60 | Total Reward: -1072.44 | Avg(10): -1218.01 | Epsilon: 0.740 | Time: 7.53s
Episode 61 | Total Reward: -970.23 | Avg(10): -1207.01 | Epsilon: 0.737 | Time: 8.88s
Episode 62 | Total Reward: -976.13 | Avg(10): -1143.71 | Epsilon: 0.733 | Time: 7.64s
Episode 63 | Total Reward: -1200.18 | Avg(10): -1165.66 | Epsilon: 0.729 | Time: 7.81s
Episode 64 | Total Reward: -912.16 | Avg(10): -1159.15 | Epsilon: 0.726 | Time: 7.58s
Episode 65 | Total Reward: -1094.00 | Avg(10): -1130.69 | Epsilon: 0.722 | Time: 7.68s
Episode 66 | Total Reward: -867.24 | Avg(10): -1098.31 | Epsilon: 0.718 | Time: 8.07s
Episode 67 | Total Reward: -1328.55 | Avg(10): -1107.67 | Epsilon: 0.715 | Time: 7.35s
Episode 68 | Total Reward: -766.90 | Avg(10): -1046.61 | Epsilon: 0.711 | Time: 7.24s
Episode 69 | Total Reward: -1535.35 | Avg(10): -1072.32 | Epsilon: 0.708 | Time: 7.04s
Episode 70 | Total Reward: -902.88 | Avg(10): -1055.36 | Epsilon: 0.704 | Time: 7.14s
Episode 71 | Total Reward: -1478.18 | Avg(10): -1106.16 | Epsilon: 0.701 | Time: 7.08s
Episode 72 | Total Reward: -981.54 | Avg(10): -1106.70 | Epsilon: 0.697 | Time: 7.08s
Episode 73 | Total Reward: -1026.96 | Avg(10): -1089.38 | Epsilon: 0.694 | Time: 7.29s
Episode 74 | Total Reward: -1267.75 | Avg(10): -1124.94 | Epsilon: 0.690 | Time: 7.34s
Episode 75 | Total Reward: -948.46 | Avg(10): -1110.38 | Epsilon: 0.687 | Time: 8.33s
Episode 76 | Total Reward: -1248.41 | Avg(10): -1148.50 | Epsilon: 0.683 | Time: 7.51s
Episode 77 | Total Reward: -1081.21 | Avg(10): -1123.77 | Epsilon: 0.680 | Time: 7.65s
Episode 78 | Total Reward: -1018.19 | Avg(10): -1148.90 | Epsilon: 0.676 | Time: 7.49s
Episode 79 | Total Reward: -899.24 | Avg(10): -1085.28 | Epsilon: 0.673 | Time: 7.55s
Episode 80 | Total Reward: -1103.02 | Avg(10): -1105.30 | Epsilon: 0.670 | Time: 7.68s
Episode 81 | Total Reward: -896.50 | Avg(10): -1047.13 | Epsilon: 0.666 | Time: 7.43s
Episode 82 | Total Reward: -767.64 | Avg(10): -1025.74 | Epsilon: 0.663 | Time: 7.28s
Episode 83 | Total Reward: -1118.51 | Avg(10): -1034.89 | Epsilon: 0.660 | Time: 7.75s
Episode 84 | Total Reward: -1155.19 | Avg(10): -1023.64 | Epsilon: 0.656 | Time: 7.30s
Episode 85 | Total Reward: -996.93 | Avg(10): -1028.49 | Epsilon: 0.653 | Time: 7.05s
Episode 86 | Total Reward: -871.52 | Avg(10): -990.80 | Epsilon: 0.650 | Time: 6.90s
Episode 87 | Total Reward: -1046.83 | Avg(10): -987.36 | Epsilon: 0.647 | Time: 6.90s
Episode 88 | Total Reward: -1098.45 | Avg(10): -995.38 | Epsilon: 0.643 | Time: 6.93s
Episode 89 | Total Reward: -1295.53 | Avg(10): -1035.01 | Epsilon: 0.640 | Time: 6.98s
Episode 90 | Total Reward: -1098.69 | Avg(10): -1034.58 | Epsilon: 0.637 | Time: 7.02s
Episode 91 | Total Reward: -1178.62 | Avg(10): -1062.79 | Epsilon: 0.634 | Time: 7.07s
Episode 92 | Total Reward: -1036.48 | Avg(10): -1089.68 | Epsilon: 0.631 | Time: 7.07s
Episode 93 | Total Reward: -1046.30 | Avg(10): -1082.45 | Epsilon: 0.627 | Time: 7.12s
Episode 94 | Total Reward: -1142.75 | Avg(10): -1081.21 | Epsilon: 0.624 | Time: 6.99s
Episode 95 | Total Reward: -1194.96 | Avg(10): -1101.01 | Epsilon: 0.621 | Time: 7.05s
Episode 96 | Total Reward: -822.91 | Avg(10): -1096.15 | Epsilon: 0.618 | Time: 7.05s
Episode 97 | Total Reward: -1022.75 | Avg(10): -1093.74 | Epsilon: 0.615 | Time: 7.18s
Episode 98 | Total Reward: -1027.33 | Avg(10): -1086.63 | Epsilon: 0.612 | Time: 7.23s
Episode 99 | Total Reward: -1129.94 | Avg(10): -1070.07 | Epsilon: 0.609 | Time: 6.95s
Episode 100 | Total Reward: -998.95 | Avg(10): -1060.10 | Epsilon: 0.606 | Time: 6.96s
Episode 101 | Total Reward: -1125.86 | Avg(10): -1054.82 | Epsilon: 0.603 | Time: 6.90s
Episode 102 | Total Reward: -1172.94 | Avg(10): -1068.47 | Epsilon: 0.600 | Time: 6.97s
Episode 103 | Total Reward: -1164.58 | Avg(10): -1080.30 | Epsilon: 0.597 | Time: 6.88s
Episode 104 | Total Reward: -1143.04 | Avg(10): -1080.33 | Epsilon: 0.594 | Time: 7.06s
Episode 105 | Total Reward: -1063.08 | Avg(10): -1067.14 | Epsilon: 0.591 | Time: 7.17s
Episode 106 | Total Reward: -1173.63 | Avg(10): -1102.21 | Epsilon: 0.588 | Time: 7.00s
Episode 107 | Total Reward: -918.86 | Avg(10): -1091.82 | Epsilon: 0.585 | Time: 7.18s
Episode 108 | Total Reward: -1137.13 | Avg(10): -1102.80 | Epsilon: 0.582 | Time: 7.09s
Episode 109 | Total Reward: -1043.07 | Avg(10): -1094.12 | Epsilon: 0.579 | Time: 7.17s
Episode 110 | Total Reward: -1090.53 | Avg(10): -1103.27 | Epsilon: 0.576 | Time: 7.02s
Episode 111 | Total Reward: -986.00 | Avg(10): -1089.29 | Epsilon: 0.573 | Time: 7.14s
Episode 112 | Total Reward: -1144.21 | Avg(10): -1086.41 | Epsilon: 0.570 | Time: 7.26s
Episode 113 | Total Reward: -1110.67 | Avg(10): -1081.02 | Epsilon: 0.568 | Time: 7.24s
Episode 114 | Total Reward: -1028.06 | Avg(10): -1069.52 | Epsilon: 0.565 | Time: 7.16s
Episode 115 | Total Reward: -1195.66 | Avg(10): -1082.78 | Epsilon: 0.562 | Time: 7.00s
Episode 116 | Total Reward: -1045.49 | Avg(10): -1069.97 | Epsilon: 0.559 | Time: 7.05s
Episode 117 | Total Reward: -863.62 | Avg(10): -1064.44 | Epsilon: 0.556 | Time: 7.14s
Episode 118 | Total Reward: -1026.46 | Avg(10): -1053.38 | Epsilon: 0.554 | Time: 7.07s
Episode 119 | Total Reward: -1121.95 | Avg(10): -1061.26 | Epsilon: 0.551 | Time: 7.07s
Episode 120 | Total Reward: -1137.39 | Avg(10): -1065.95 | Epsilon: 0.548 | Time: 7.00s
Episode 121 | Total Reward: -938.99 | Avg(10): -1061.25 | Epsilon: 0.545 | Time: 7.04s
Episode 122 | Total Reward: -996.89 | Avg(10): -1046.52 | Epsilon: 0.543 | Time: 6.99s
Episode 123 | Total Reward: -1084.20 | Avg(10): -1043.87 | Epsilon: 0.540 | Time: 7.09s
Episode 124 | Total Reward: -915.17 | Avg(10): -1032.58 | Epsilon: 0.537 | Time: 7.20s
Episode 125 | Total Reward: -939.52 | Avg(10): -1006.97 | Epsilon: 0.534 | Time: 7.14s
Episode 126 | Total Reward: -1220.66 | Avg(10): -1024.48 | Epsilon: 0.532 | Time: 7.38s
Episode 127 | Total Reward: -1025.33 | Avg(10): -1040.66 | Epsilon: 0.529 | Time: 7.35s
Episode 128 | Total Reward: -1349.09 | Avg(10): -1072.92 | Epsilon: 0.526 | Time: 7.22s
Episode 129 | Total Reward: -1130.55 | Avg(10): -1073.78 | Epsilon: 0.524 | Time: 7.21s
Episode 130 | Total Reward: -887.50 | Avg(10): -1048.79 | Epsilon: 0.521 | Time: 7.06s
Episode 131 | Total Reward: -1034.06 | Avg(10): -1058.30 | Epsilon: 0.519 | Time: 7.15s
Episode 132 | Total Reward: -1041.05 | Avg(10): -1062.71 | Epsilon: 0.516 | Time: 7.11s
Episode 133 | Total Reward: -1149.00 | Avg(10): -1069.19 | Epsilon: 0.513 | Time: 7.18s
Episode 134 | Total Reward: -891.67 | Avg(10): -1066.84 | Epsilon: 0.511 | Time: 7.39s
Episode 135 | Total Reward: -773.17 | Avg(10): -1050.21 | Epsilon: 0.508 | Time: 7.08s
Episode 136 | Total Reward: -1179.41 | Avg(10): -1046.08 | Epsilon: 0.506 | Time: 7.07s
Episode 137 | Total Reward: -1068.36 | Avg(10): -1050.39 | Epsilon: 0.503 | Time: 7.04s
Episode 138 | Total Reward: -899.05 | Avg(10): -1005.38 | Epsilon: 0.501 | Time: 7.07s
Episode 139 | Total Reward: -1250.47 | Avg(10): -1017.37 | Epsilon: 0.498 | Time: 7.09s
Episode 140 | Total Reward: -770.01 | Avg(10): -1005.63 | Epsilon: 0.496 | Time: 7.24s
Episode 141 | Total Reward: -1001.02 | Avg(10): -1002.32 | Epsilon: 0.493 | Time: 7.21s
Episode 142 | Total Reward: -952.21 | Avg(10): -993.44 | Epsilon: 0.491 | Time: 7.15s
Episode 143 | Total Reward: -996.53 | Avg(10): -978.19 | Epsilon: 0.488 | Time: 7.25s
Episode 144 | Total Reward: -870.74 | Avg(10): -976.10 | Epsilon: 0.486 | Time: 7.26s
Episode 145 | Total Reward: -933.12 | Avg(10): -992.09 | Epsilon: 0.483 | Time: 7.30s
Episode 146 | Total Reward: -732.86 | Avg(10): -947.44 | Epsilon: 0.481 | Time: 7.14s
Episode 147 | Total Reward: -882.00 | Avg(10): -928.80 | Epsilon: 0.479 | Time: 7.51s
Episode 148 | Total Reward: -886.30 | Avg(10): -927.53 | Epsilon: 0.476 | Time: 7.32s
Episode 149 | Total Reward: -1125.37 | Avg(10): -915.02 | Epsilon: 0.474 | Time: 7.12s
Episode 150 | Total Reward: -1033.84 | Avg(10): -941.40 | Epsilon: 0.471 | Time: 7.06s
Episode 151 | Total Reward: -894.47 | Avg(10): -930.74 | Epsilon: 0.469 | Time: 7.13s
Episode 152 | Total Reward: -999.32 | Avg(10): -935.45 | Epsilon: 0.467 | Time: 7.13s
Episode 153 | Total Reward: -871.96 | Avg(10): -923.00 | Epsilon: 0.464 | Time: 7.06s
Episode 154 | Total Reward: -828.17 | Avg(10): -918.74 | Epsilon: 0.462 | Time: 7.10s
Episode 155 | Total Reward: -740.71 | Avg(10): -899.50 | Epsilon: 0.460 | Time: 7.10s
NEW BEST at Episode 155: -899.50 (improved by 10.00)
Episode 156 | Total Reward: -752.76 | Avg(10): -901.49 | Epsilon: 0.458 | Time: 7.16s
Episode 157 | Total Reward: -1027.18 | Avg(10): -916.01 | Epsilon: 0.455 | Time: 7.21s
Episode 158 | Total Reward: -938.66 | Avg(10): -921.24 | Epsilon: 0.453 | Time: 7.18s
Episode 159 | Total Reward: -898.00 | Avg(10): -898.51 | Epsilon: 0.451 | Time: 7.27s
Episode 160 | Total Reward: -743.57 | Avg(10): -869.48 | Epsilon: 0.448 | Time: 7.26s
NEW BEST at Episode 160: -869.48 (improved by 10.00)
Episode 161 | Total Reward: -756.36 | Avg(10): -855.67 | Epsilon: 0.446 | Time: 7.32s
NEW BEST at Episode 161: -855.67 (improved by 10.00)
Episode 162 | Total Reward: -899.04 | Avg(10): -845.64 | Epsilon: 0.444 | Time: 7.27s
NEW BEST at Episode 162: -845.64 (improved by 10.00)
Episode 163 | Total Reward: -739.83 | Avg(10): -832.43 | Epsilon: 0.442 | Time: 7.25s
NEW BEST at Episode 163: -832.43 (improved by 10.00)
Episode 164 | Total Reward: -492.48 | Avg(10): -798.86 | Epsilon: 0.440 | Time: 7.37s
NEW BEST at Episode 164: -798.86 (improved by 10.00)
Episode 165 | Total Reward: -291.59 | Avg(10): -753.95 | Epsilon: 0.437 | Time: 7.38s
NEW BEST at Episode 165: -753.95 (improved by 10.00)
Episode 166 | Total Reward: -871.96 | Avg(10): -765.87 | Epsilon: 0.435 | Time: 7.21s
Episode 167 | Total Reward: -493.26 | Avg(10): -712.47 | Epsilon: 0.433 | Time: 7.37s
NEW BEST at Episode 167: -712.47 (improved by 10.00)
Episode 168 | Total Reward: -503.39 | Avg(10): -668.95 | Epsilon: 0.431 | Time: 7.39s
NEW BEST at Episode 168: -668.95 (improved by 10.00)
Episode 169 | Total Reward: -633.44 | Avg(10): -642.49 | Epsilon: 0.429 | Time: 7.22s
NEW BEST at Episode 169: -642.49 (improved by 10.00)
Episode 170 | Total Reward: -760.65 | Avg(10): -644.20 | Epsilon: 0.427 | Time: 7.23s
Episode 171 | Total Reward: -947.78 | Avg(10): -663.34 | Epsilon: 0.424 | Time: 7.32s
Episode 172 | Total Reward: -767.84 | Avg(10): -650.22 | Epsilon: 0.422 | Time: 7.19s
Episode 173 | Total Reward: -981.71 | Avg(10): -674.41 | Epsilon: 0.420 | Time: 7.17s
Episode 174 | Total Reward: -740.78 | Avg(10): -699.24 | Epsilon: 0.418 | Time: 7.23s
Episode 175 | Total Reward: -627.28 | Avg(10): -732.81 | Epsilon: 0.416 | Time: 7.31s
Episode 176 | Total Reward: -883.68 | Avg(10): -733.98 | Epsilon: 0.414 | Time: 7.37s
Episode 177 | Total Reward: -490.94 | Avg(10): -733.75 | Epsilon: 0.412 | Time: 7.50s
Episode 178 | Total Reward: -508.46 | Avg(10): -734.26 | Epsilon: 0.410 | Time: 7.25s
Episode 179 | Total Reward: -747.06 | Avg(10): -745.62 | Epsilon: 0.408 | Time: 7.32s
Episode 180 | Total Reward: -764.22 | Avg(10): -745.97 | Epsilon: 0.406 | Time: 7.37s
Episode 181 | Total Reward: -747.27 | Avg(10): -725.92 | Epsilon: 0.404 | Time: 7.31s
Episode 182 | Total Reward: -382.23 | Avg(10): -687.36 | Epsilon: 0.402 | Time: 7.32s
Episode 183 | Total Reward: -850.70 | Avg(10): -674.26 | Epsilon: 0.400 | Time: 7.26s
Episode 184 | Total Reward: -616.31 | Avg(10): -661.82 | Epsilon: 0.398 | Time: 7.17s
Episode 185 | Total Reward: -507.19 | Avg(10): -649.81 | Epsilon: 0.396 | Time: 7.24s
Episode 186 | Total Reward: -634.04 | Avg(10): -624.84 | Epsilon: 0.394 | Time: 7.74s
NEW BEST at Episode 186: -624.84 (improved by 10.00)
Episode 187 | Total Reward: -768.56 | Avg(10): -652.60 | Epsilon: 0.392 | Time: 7.92s
Episode 188 | Total Reward: -503.95 | Avg(10): -652.15 | Epsilon: 0.390 | Time: 7.32s
Episode 189 | Total Reward: -500.48 | Avg(10): -627.50 | Epsilon: 0.388 | Time: 7.18s
Episode 190 | Total Reward: -626.09 | Avg(10): -613.68 | Epsilon: 0.386 | Time: 7.29s
NEW BEST at Episode 190: -613.68 (improved by 10.00)
Episode 191 | Total Reward: -378.13 | Avg(10): -576.77 | Epsilon: 0.384 | Time: 7.31s
NEW BEST at Episode 191: -576.77 (improved by 10.00)
Episode 192 | Total Reward: -977.34 | Avg(10): -636.28 | Epsilon: 0.382 | Time: 7.28s
Episode 193 | Total Reward: -619.76 | Avg(10): -613.19 | Epsilon: 0.380 | Time: 7.38s
Episode 194 | Total Reward: -502.75 | Avg(10): -601.83 | Epsilon: 0.378 | Time: 7.31s
Episode 195 | Total Reward: -611.56 | Avg(10): -612.27 | Epsilon: 0.376 | Time: 7.31s
Episode 196 | Total Reward: -741.96 | Avg(10): -623.06 | Epsilon: 0.374 | Time: 7.22s
Episode 197 | Total Reward: -749.59 | Avg(10): -621.16 | Epsilon: 0.373 | Time: 7.28s
Episode 198 | Total Reward: -760.60 | Avg(10): -646.83 | Epsilon: 0.371 | Time: 7.20s
Episode 199 | Total Reward: -751.55 | Avg(10): -671.93 | Epsilon: 0.369 | Time: 7.15s
Episode 200 | Total Reward: -770.25 | Avg(10): -686.35 | Epsilon: 0.367 | Time: 7.31s
Episode 201 | Total Reward: -741.94 | Avg(10): -722.73 | Epsilon: 0.365 | Time: 7.14s
Episode 202 | Total Reward: -498.65 | Avg(10): -674.86 | Epsilon: 0.363 | Time: 7.19s
Episode 203 | Total Reward: -376.54 | Avg(10): -650.54 | Epsilon: 0.361 | Time: 7.11s
Episode 204 | Total Reward: -738.43 | Avg(10): -674.11 | Epsilon: 0.360 | Time: 7.07s
Episode 205 | Total Reward: -630.99 | Avg(10): -676.05 | Epsilon: 0.358 | Time: 7.12s
Episode 206 | Total Reward: -502.60 | Avg(10): -652.11 | Epsilon: 0.356 | Time: 7.25s
Episode 207 | Total Reward: -254.61 | Avg(10): -602.62 | Epsilon: 0.354 | Time: 7.40s
Episode 208 | Total Reward: -502.77 | Avg(10): -576.83 | Epsilon: 0.353 | Time: 7.27s
Episode 209 | Total Reward: -498.74 | Avg(10): -551.55 | Epsilon: 0.351 | Time: 7.33s
NEW BEST at Episode 209: -551.55 (improved by 10.00)
Episode 210 | Total Reward: -499.87 | Avg(10): -524.51 | Epsilon: 0.349 | Time: 7.29s
NEW BEST at Episode 210: -524.51 (improved by 10.00)
Episode 211 | Total Reward: -746.79 | Avg(10): -525.00 | Epsilon: 0.347 | Time: 7.17s
Episode 212 | Total Reward: -253.82 | Avg(10): -500.52 | Epsilon: 0.346 | Time: 7.23s
NEW BEST at Episode 212: -500.52 (improved by 10.00)
Episode 213 | Total Reward: -621.11 | Avg(10): -524.97 | Epsilon: 0.344 | Time: 7.46s
Episode 214 | Total Reward: -612.31 | Avg(10): -512.36 | Epsilon: 0.342 | Time: 7.79s
Episode 215 | Total Reward: -490.60 | Avg(10): -498.32 | Epsilon: 0.340 | Time: 7.30s
Episode 216 | Total Reward: -615.27 | Avg(10): -509.59 | Epsilon: 0.339 | Time: 7.13s
Episode 217 | Total Reward: -379.73 | Avg(10): -522.10 | Epsilon: 0.337 | Time: 7.13s
Episode 218 | Total Reward: -254.19 | Avg(10): -497.24 | Epsilon: 0.335 | Time: 7.14s
Episode 219 | Total Reward: -361.22 | Avg(10): -483.49 | Epsilon: 0.334 | Time: 7.18s
NEW BEST at Episode 219: -483.49 (improved by 10.00)
Episode 220 | Total Reward: -603.84 | Avg(10): -493.89 | Epsilon: 0.332 | Time: 7.07s
Episode 221 | Total Reward: -864.70 | Avg(10): -505.68 | Epsilon: 0.330 | Time: 7.13s
Episode 222 | Total Reward: -500.97 | Avg(10): -530.39 | Epsilon: 0.329 | Time: 7.28s
Episode 223 | Total Reward: -480.86 | Avg(10): -516.37 | Epsilon: 0.327 | Time: 7.21s
Episode 224 | Total Reward: -363.12 | Avg(10): -491.45 | Epsilon: 0.325 | Time: 7.24s
Episode 225 | Total Reward: -877.35 | Avg(10): -530.13 | Epsilon: 0.324 | Time: 7.29s
Episode 226 | Total Reward: -501.62 | Avg(10): -518.76 | Epsilon: 0.322 | Time: 7.27s
Episode 227 | Total Reward: -553.47 | Avg(10): -536.13 | Epsilon: 0.321 | Time: 7.31s
Episode 228 | Total Reward: -611.68 | Avg(10): -571.88 | Epsilon: 0.319 | Time: 7.35s
Episode 229 | Total Reward: -360.63 | Avg(10): -571.82 | Epsilon: 0.317 | Time: 7.42s
Episode 230 | Total Reward: -459.44 | Avg(10): -557.38 | Epsilon: 0.316 | Time: 7.35s
Episode 231 | Total Reward: -383.10 | Avg(10): -509.22 | Epsilon: 0.314 | Time: 7.29s
Episode 232 | Total Reward: -373.41 | Avg(10): -496.47 | Epsilon: 0.313 | Time: 7.12s
Episode 233 | Total Reward: -483.84 | Avg(10): -496.77 | Epsilon: 0.311 | Time: 7.08s
Episode 234 | Total Reward: -584.61 | Avg(10): -518.91 | Epsilon: 0.309 | Time: 7.19s
Episode 235 | Total Reward: -615.78 | Avg(10): -492.76 | Epsilon: 0.308 | Time: 7.22s
Episode 236 | Total Reward: -375.53 | Avg(10): -480.15 | Epsilon: 0.306 | Time: 7.21s
Episode 237 | Total Reward: -631.11 | Avg(10): -487.91 | Epsilon: 0.305 | Time: 7.21s
Episode 238 | Total Reward: -255.18 | Avg(10): -452.26 | Epsilon: 0.303 | Time: 7.20s
NEW BEST at Episode 238: -452.26 (improved by 10.00)
Episode 239 | Total Reward: -494.07 | Avg(10): -465.61 | Epsilon: 0.302 | Time: 7.20s
Episode 240 | Total Reward: -483.78 | Avg(10): -468.04 | Epsilon: 0.300 | Time: 7.27s
Episode 241 | Total Reward: -256.74 | Avg(10): -455.41 | Epsilon: 0.299 | Time: 7.30s
Episode 242 | Total Reward: -376.51 | Avg(10): -455.72 | Epsilon: 0.297 | Time: 7.38s
Episode 243 | Total Reward: -372.41 | Avg(10): -444.57 | Epsilon: 0.296 | Time: 7.25s
Episode 244 | Total Reward: -376.74 | Avg(10): -423.78 | Epsilon: 0.294 | Time: 7.20s
NEW BEST at Episode 244: -423.78 (improved by 10.00)
Episode 245 | Total Reward: -364.68 | Avg(10): -398.67 | Epsilon: 0.293 | Time: 7.29s
NEW BEST at Episode 245: -398.67 (improved by 10.00)
Episode 246 | Total Reward: -246.43 | Avg(10): -385.76 | Epsilon: 0.291 | Time: 7.25s
NEW BEST at Episode 246: -385.76 (improved by 10.00)
Episode 247 | Total Reward: -510.17 | Avg(10): -373.67 | Epsilon: 0.290 | Time: 7.26s
NEW BEST at Episode 247: -373.67 (improved by 10.00)
Episode 248 | Total Reward: -628.51 | Avg(10): -411.00 | Epsilon: 0.288 | Time: 7.14s
Episode 249 | Total Reward: -633.44 | Avg(10): -424.94 | Epsilon: 0.287 | Time: 7.20s
Episode 250 | Total Reward: -875.09 | Avg(10): -464.07 | Epsilon: 0.286 | Time: 7.13s
Episode 251 | Total Reward: -370.74 | Avg(10): -475.47 | Epsilon: 0.284 | Time: 7.09s
Episode 252 | Total Reward: -246.08 | Avg(10): -462.43 | Epsilon: 0.283 | Time: 7.27s
Episode 253 | Total Reward: -634.22 | Avg(10): -488.61 | Epsilon: 0.281 | Time: 7.41s
Episode 254 | Total Reward: -511.24 | Avg(10): -502.06 | Epsilon: 0.280 | Time: 7.28s
Episode 255 | Total Reward: -726.29 | Avg(10): -538.22 | Epsilon: 0.279 | Time: 7.20s
Episode 256 | Total Reward: -376.70 | Avg(10): -551.25 | Epsilon: 0.277 | Time: 7.39s
Episode 257 | Total Reward: -258.04 | Avg(10): -526.03 | Epsilon: 0.276 | Time: 7.38s
Episode 258 | Total Reward: -763.12 | Avg(10): -539.49 | Epsilon: 0.274 | Time: 7.31s
Episode 259 | Total Reward: -377.71 | Avg(10): -513.92 | Epsilon: 0.273 | Time: 7.41s
Episode 260 | Total Reward: -624.34 | Avg(10): -488.85 | Epsilon: 0.272 | Time: 7.44s
Episode 261 | Total Reward: -978.71 | Avg(10): -549.64 | Epsilon: 0.270 | Time: 7.34s
Episode 262 | Total Reward: -624.98 | Avg(10): -587.53 | Epsilon: 0.269 | Time: 7.41s
Episode 263 | Total Reward: -502.66 | Avg(10): -574.38 | Epsilon: 0.268 | Time: 7.33s
Episode 264 | Total Reward: -537.43 | Avg(10): -577.00 | Epsilon: 0.266 | Time: 7.34s
Episode 265 | Total Reward: -615.53 | Avg(10): -565.92 | Epsilon: 0.265 | Time: 7.40s
Episode 266 | Total Reward: -913.18 | Avg(10): -619.57 | Epsilon: 0.264 | Time: 7.25s
Episode 267 | Total Reward: -645.53 | Avg(10): -658.32 | Epsilon: 0.262 | Time: 7.27s
Episode 268 | Total Reward: -251.17 | Avg(10): -607.12 | Epsilon: 0.261 | Time: 7.38s
Episode 269 | Total Reward: -841.59 | Avg(10): -653.51 | Epsilon: 0.260 | Time: 7.19s
Episode 270 | Total Reward: -625.72 | Avg(10): -653.65 | Epsilon: 0.258 | Time: 7.38s
Episode 271 | Total Reward: -381.14 | Avg(10): -593.89 | Epsilon: 0.257 | Time: 8.37s
Episode 272 | Total Reward: -613.04 | Avg(10): -592.70 | Epsilon: 0.256 | Time: 7.70s
Episode 273 | Total Reward: -123.89 | Avg(10): -554.82 | Epsilon: 0.255 | Time: 7.56s
Episode 274 | Total Reward: -747.68 | Avg(10): -575.85 | Epsilon: 0.253 | Time: 8.11s
Episode 275 | Total Reward: -252.25 | Avg(10): -539.52 | Epsilon: 0.252 | Time: 7.40s
Episode 276 | Total Reward: -251.84 | Avg(10): -473.38 | Epsilon: 0.251 | Time: 7.29s
Episode 277 | Total Reward: -126.66 | Avg(10): -421.50 | Epsilon: 0.249 | Time: 7.39s
Episode 278 | Total Reward: -592.47 | Avg(10): -455.63 | Epsilon: 0.248 | Time: 7.39s
Episode 279 | Total Reward: -127.77 | Avg(10): -384.25 | Epsilon: 0.247 | Time: 7.28s
Episode 280 | Total Reward: -803.70 | Avg(10): -402.05 | Epsilon: 0.246 | Time: 7.30s
Episode 281 | Total Reward: -249.22 | Avg(10): -388.85 | Epsilon: 0.245 | Time: 7.25s
Episode 282 | Total Reward: -246.04 | Avg(10): -352.15 | Epsilon: 0.243 | Time: 7.28s
NEW BEST at Episode 282: -352.15 (improved by 10.00)
Episode 283 | Total Reward: -126.98 | Avg(10): -352.46 | Epsilon: 0.242 | Time: 7.18s
Episode 284 | Total Reward: -255.15 | Avg(10): -303.21 | Epsilon: 0.241 | Time: 7.26s
NEW BEST at Episode 284: -303.21 (improved by 10.00)
Episode 285 | Total Reward: -242.60 | Avg(10): -302.25 | Epsilon: 0.240 | Time: 7.20s
Episode 286 | Total Reward: -266.18 | Avg(10): -303.68 | Epsilon: 0.238 | Time: 7.14s
Episode 287 | Total Reward: -239.42 | Avg(10): -314.95 | Epsilon: 0.237 | Time: 7.15s
Episode 288 | Total Reward: -382.72 | Avg(10): -293.98 | Epsilon: 0.236 | Time: 7.21s
Episode 289 | Total Reward: -241.36 | Avg(10): -305.34 | Epsilon: 0.235 | Time: 7.20s
Episode 290 | Total Reward: -636.42 | Avg(10): -288.61 | Epsilon: 0.234 | Time: 7.32s
NEW BEST at Episode 290: -288.61 (improved by 10.00)
Episode 291 | Total Reward: -251.31 | Avg(10): -288.82 | Epsilon: 0.233 | Time: 7.35s
Episode 292 | Total Reward: -373.71 | Avg(10): -301.59 | Epsilon: 0.231 | Time: 7.33s
Episode 293 | Total Reward: -372.92 | Avg(10): -326.18 | Epsilon: 0.230 | Time: 7.38s
Episode 294 | Total Reward: -124.88 | Avg(10): -313.15 | Epsilon: 0.229 | Time: 7.44s
Episode 295 | Total Reward: -126.83 | Avg(10): -301.58 | Epsilon: 0.228 | Time: 7.37s
Episode 296 | Total Reward: -360.29 | Avg(10): -310.99 | Epsilon: 0.227 | Time: 7.27s
Episode 297 | Total Reward: -623.57 | Avg(10): -349.40 | Epsilon: 0.226 | Time: 7.29s
Episode 298 | Total Reward: -614.54 | Avg(10): -372.58 | Epsilon: 0.225 | Time: 7.82s
Episode 299 | Total Reward: -488.24 | Avg(10): -397.27 | Epsilon: 0.223 | Time: 7.51s
Episode 300 | Total Reward: -497.04 | Avg(10): -383.33 | Epsilon: 0.222 | Time: 7.28s
Episode 301 | Total Reward: -126.93 | Avg(10): -370.89 | Epsilon: 0.221 | Time: 7.25s
Episode 302 | Total Reward: -367.99 | Avg(10): -370.32 | Epsilon: 0.220 | Time: 7.22s
Episode 303 | Total Reward: -124.23 | Avg(10): -345.45 | Epsilon: 0.219 | Time: 7.26s
Episode 304 | Total Reward: -237.04 | Avg(10): -356.67 | Epsilon: 0.218 | Time: 7.20s
Episode 305 | Total Reward: -501.15 | Avg(10): -394.10 | Epsilon: 0.217 | Time: 7.51s
Episode 306 | Total Reward: -380.89 | Avg(10): -396.16 | Epsilon: 0.216 | Time: 7.34s
Episode 307 | Total Reward: -245.93 | Avg(10): -358.40 | Epsilon: 0.215 | Time: 7.35s
Episode 308 | Total Reward: -499.39 | Avg(10): -346.88 | Epsilon: 0.214 | Time: 7.31s
Episode 309 | Total Reward: -376.67 | Avg(10): -335.72 | Epsilon: 0.212 | Time: 7.41s
Episode 310 | Total Reward: -254.19 | Avg(10): -311.44 | Epsilon: 0.211 | Time: 7.48s
Episode 311 | Total Reward: -123.44 | Avg(10): -311.09 | Epsilon: 0.210 | Time: 7.50s
Episode 312 | Total Reward: -486.29 | Avg(10): -322.92 | Epsilon: 0.209 | Time: 7.39s
Episode 313 | Total Reward: -253.13 | Avg(10): -335.81 | Epsilon: 0.208 | Time: 7.24s
Episode 314 | Total Reward: -124.62 | Avg(10): -324.57 | Epsilon: 0.207 | Time: 7.18s
Episode 315 | Total Reward: -243.69 | Avg(10): -298.82 | Epsilon: 0.206 | Time: 7.16s
Episode 316 | Total Reward: -374.63 | Avg(10): -298.20 | Epsilon: 0.205 | Time: 7.19s
Episode 317 | Total Reward: -123.66 | Avg(10): -285.97 | Epsilon: 0.204 | Time: 7.17s
Episode 318 | Total Reward: -252.47 | Avg(10): -261.28 | Epsilon: 0.203 | Time: 7.15s
NEW BEST at Episode 318: -261.28 (improved by 10.00)
Episode 319 | Total Reward: -505.06 | Avg(10): -274.12 | Epsilon: 0.202 | Time: 7.24s
Episode 320 | Total Reward: -484.09 | Avg(10): -297.11 | Epsilon: 0.201 | Time: 7.30s
Episode 321 | Total Reward: -255.59 | Avg(10): -310.32 | Epsilon: 0.200 | Time: 7.31s
Episode 322 | Total Reward: -120.81 | Avg(10): -273.77 | Epsilon: 0.199 | Time: 7.33s
Episode 323 | Total Reward: -250.02 | Avg(10): -273.46 | Epsilon: 0.198 | Time: 7.29s
Episode 324 | Total Reward: -241.72 | Avg(10): -285.17 | Epsilon: 0.197 | Time: 7.35s
Episode 325 | Total Reward: -252.27 | Avg(10): -286.03 | Epsilon: 0.196 | Time: 7.37s
Episode 326 | Total Reward: -119.81 | Avg(10): -260.55 | Epsilon: 0.195 | Time: 7.32s
Episode 327 | Total Reward: -251.25 | Avg(10): -273.31 | Epsilon: 0.194 | Time: 7.38s
Episode 328 | Total Reward: -590.68 | Avg(10): -307.13 | Epsilon: 0.193 | Time: 8.43s
Episode 329 | Total Reward: -252.74 | Avg(10): -281.90 | Epsilon: 0.192 | Time: 7.35s
Episode 330 | Total Reward: -124.97 | Avg(10): -245.99 | Epsilon: 0.191 | Time: 7.29s
NEW BEST at Episode 330: -245.99 (improved by 10.00)
Episode 331 | Total Reward: -125.75 | Avg(10): -233.00 | Epsilon: 0.190 | Time: 7.31s
NEW BEST at Episode 331: -233.00 (improved by 10.00)
Episode 332 | Total Reward: -122.40 | Avg(10): -233.16 | Epsilon: 0.189 | Time: 7.45s
Episode 333 | Total Reward: -117.88 | Avg(10): -219.95 | Epsilon: 0.188 | Time: 7.56s
NEW BEST at Episode 333: -219.95 (improved by 10.00)
Episode 334 | Total Reward: -376.20 | Avg(10): -233.39 | Epsilon: 0.187 | Time: 7.41s
Episode 335 | Total Reward: -250.96 | Avg(10): -233.26 | Epsilon: 0.187 | Time: 7.37s
Episode 336 | Total Reward: -239.06 | Avg(10): -245.19 | Epsilon: 0.186 | Time: 7.29s
Episode 337 | Total Reward: -125.51 | Avg(10): -232.62 | Epsilon: 0.185 | Time: 7.40s
Episode 338 | Total Reward: -127.22 | Avg(10): -186.27 | Epsilon: 0.184 | Time: 7.36s
NEW BEST at Episode 338: -186.27 (improved by 10.00)
Episode 339 | Total Reward: -363.89 | Avg(10): -197.38 | Epsilon: 0.183 | Time: 7.42s
Episode 340 | Total Reward: -125.80 | Avg(10): -197.47 | Epsilon: 0.182 | Time: 7.47s
Episode 341 | Total Reward: -1.17 | Avg(10): -185.01 | Epsilon: 0.181 | Time: 7.47s
Episode 342 | Total Reward: -238.65 | Avg(10): -196.63 | Epsilon: 0.180 | Time: 7.39s
Episode 343 | Total Reward: -473.67 | Avg(10): -232.21 | Epsilon: 0.179 | Time: 7.43s
Episode 344 | Total Reward: -128.65 | Avg(10): -207.46 | Epsilon: 0.178 | Time: 7.35s
Episode 345 | Total Reward: -1.29 | Avg(10): -182.49 | Epsilon: 0.177 | Time: 7.41s
Episode 346 | Total Reward: -468.77 | Avg(10): -205.46 | Epsilon: 0.177 | Time: 7.36s
Episode 347 | Total Reward: -123.64 | Avg(10): -205.27 | Epsilon: 0.176 | Time: 7.23s
Episode 348 | Total Reward: -370.18 | Avg(10): -229.57 | Epsilon: 0.175 | Time: 7.29s
Episode 349 | Total Reward: -627.95 | Avg(10): -255.98 | Epsilon: 0.174 | Time: 7.32s
Episode 350 | Total Reward: -355.52 | Avg(10): -278.95 | Epsilon: 0.173 | Time: 7.23s
Episode 351 | Total Reward: -246.09 | Avg(10): -303.44 | Epsilon: 0.172 | Time: 7.19s
Episode 352 | Total Reward: -490.50 | Avg(10): -328.63 | Epsilon: 0.171 | Time: 7.38s
Episode 353 | Total Reward: -1.83 | Avg(10): -281.44 | Epsilon: 0.170 | Time: 7.31s
Episode 354 | Total Reward: -469.67 | Avg(10): -315.54 | Epsilon: 0.170 | Time: 7.46s
Episode 355 | Total Reward: -4.21 | Avg(10): -315.84 | Epsilon: 0.169 | Time: 7.44s
Episode 356 | Total Reward: -380.41 | Avg(10): -307.00 | Epsilon: 0.168 | Time: 7.46s
Episode 357 | Total Reward: -123.78 | Avg(10): -307.01 | Epsilon: 0.167 | Time: 7.48s
Episode 358 | Total Reward: -373.21 | Avg(10): -307.32 | Epsilon: 0.166 | Time: 7.40s
Episode 359 | Total Reward: -234.61 | Avg(10): -267.98 | Epsilon: 0.165 | Time: 7.45s
Episode 360 | Total Reward: -124.58 | Avg(10): -244.89 | Epsilon: 0.165 | Time: 7.39s
Episode 361 | Total Reward: -250.83 | Avg(10): -245.36 | Epsilon: 0.164 | Time: 7.34s
Episode 362 | Total Reward: -1.19 | Avg(10): -196.43 | Epsilon: 0.163 | Time: 7.48s
Episode 363 | Total Reward: -120.34 | Avg(10): -208.28 | Epsilon: 0.162 | Time: 7.50s
Episode 364 | Total Reward: -243.19 | Avg(10): -185.64 | Epsilon: 0.161 | Time: 7.25s
Episode 365 | Total Reward: -123.29 | Avg(10): -197.54 | Epsilon: 0.160 | Time: 7.24s
Episode 366 | Total Reward: -1.73 | Avg(10): -159.68 | Epsilon: 0.160 | Time: 7.31s
NEW BEST at Episode 366: -159.68 (improved by 10.00)
Episode 367 | Total Reward: -247.16 | Avg(10): -172.01 | Epsilon: 0.159 | Time: 7.26s
Episode 368 | Total Reward: -121.15 | Avg(10): -146.81 | Epsilon: 0.158 | Time: 7.16s
NEW BEST at Episode 368: -146.81 (improved by 10.00)
Episode 369 | Total Reward: -126.01 | Avg(10): -135.95 | Epsilon: 0.157 | Time: 7.16s
NEW BEST at Episode 369: -135.95 (improved by 10.00)
Episode 370 | Total Reward: -362.35 | Avg(10): -159.72 | Epsilon: 0.157 | Time: 7.44s
Episode 371 | Total Reward: -125.88 | Avg(10): -147.23 | Epsilon: 0.156 | Time: 7.29s
Episode 372 | Total Reward: -126.57 | Avg(10): -159.77 | Epsilon: 0.155 | Time: 7.32s
Episode 373 | Total Reward: -125.03 | Avg(10): -160.24 | Epsilon: 0.154 | Time: 7.42s
Episode 374 | Total Reward: -361.00 | Avg(10): -172.02 | Epsilon: 0.153 | Time: 7.31s
Episode 375 | Total Reward: -119.64 | Avg(10): -171.65 | Epsilon: 0.153 | Time: 7.30s
Episode 376 | Total Reward: -125.20 | Avg(10): -184.00 | Epsilon: 0.152 | Time: 7.33s
Episode 377 | Total Reward: -372.69 | Avg(10): -196.55 | Epsilon: 0.151 | Time: 7.35s
Episode 378 | Total Reward: -507.38 | Avg(10): -235.18 | Epsilon: 0.150 | Time: 7.26s
Episode 379 | Total Reward: -244.88 | Avg(10): -247.06 | Epsilon: 0.150 | Time: 7.16s
Episode 380 | Total Reward: -2.21 | Avg(10): -211.05 | Epsilon: 0.149 | Time: 7.27s
Episode 381 | Total Reward: -122.18 | Avg(10): -210.68 | Epsilon: 0.148 | Time: 7.22s
Episode 382 | Total Reward: -247.62 | Avg(10): -222.78 | Epsilon: 0.147 | Time: 7.26s
Episode 383 | Total Reward: -362.20 | Avg(10): -246.50 | Epsilon: 0.147 | Time: 7.30s
Episode 384 | Total Reward: -242.38 | Avg(10): -234.64 | Epsilon: 0.146 | Time: 7.23s
Episode 385 | Total Reward: -377.65 | Avg(10): -260.44 | Epsilon: 0.145 | Time: 7.17s
Episode 386 | Total Reward: -233.24 | Avg(10): -271.24 | Epsilon: 0.144 | Time: 7.23s
Episode 387 | Total Reward: -128.28 | Avg(10): -246.80 | Epsilon: 0.144 | Time: 7.41s
Episode 388 | Total Reward: -120.49 | Avg(10): -208.11 | Epsilon: 0.143 | Time: 7.33s
Episode 389 | Total Reward: -120.40 | Avg(10): -195.67 | Epsilon: 0.142 | Time: 7.40s
Episode 390 | Total Reward: -250.30 | Avg(10): -220.47 | Epsilon: 0.142 | Time: 7.38s
Episode 391 | Total Reward: -390.33 | Avg(10): -247.29 | Epsilon: 0.141 | Time: 7.47s
Episode 392 | Total Reward: -497.60 | Avg(10): -272.29 | Epsilon: 0.140 | Time: 7.46s
Episode 393 | Total Reward: -128.06 | Avg(10): -248.87 | Epsilon: 0.139 | Time: 7.38s
Episode 394 | Total Reward: -366.58 | Avg(10): -261.29 | Epsilon: 0.139 | Time: 7.47s
Episode 395 | Total Reward: -340.77 | Avg(10): -257.60 | Epsilon: 0.138 | Time: 7.23s
Episode 396 | Total Reward: -365.76 | Avg(10): -270.86 | Epsilon: 0.137 | Time: 7.26s
Episode 397 | Total Reward: -249.35 | Avg(10): -282.96 | Epsilon: 0.137 | Time: 7.50s
Episode 398 | Total Reward: -535.68 | Avg(10): -324.48 | Epsilon: 0.136 | Time: 7.35s
Episode 399 | Total Reward: -238.15 | Avg(10): -336.26 | Epsilon: 0.135 | Time: 7.29s
Episode 400 | Total Reward: -116.65 | Avg(10): -322.89 | Epsilon: 0.135 | Time: 7.26s
Episode 401 | Total Reward: -239.69 | Avg(10): -307.83 | Epsilon: 0.134 | Time: 7.27s
Episode 402 | Total Reward: -373.85 | Avg(10): -295.45 | Epsilon: 0.133 | Time: 7.21s
Episode 403 | Total Reward: -1.17 | Avg(10): -282.76 | Epsilon: 0.133 | Time: 7.41s
Episode 404 | Total Reward: -1.36 | Avg(10): -246.24 | Epsilon: 0.132 | Time: 7.42s
Episode 405 | Total Reward: -125.45 | Avg(10): -224.71 | Epsilon: 0.131 | Time: 7.42s
Episode 406 | Total Reward: -233.04 | Avg(10): -211.44 | Epsilon: 0.131 | Time: 7.40s
Episode 407 | Total Reward: -478.07 | Avg(10): -234.31 | Epsilon: 0.130 | Time: 7.40s
Episode 408 | Total Reward: -118.10 | Avg(10): -192.55 | Epsilon: 0.129 | Time: 7.46s
Episode 409 | Total Reward: -123.45 | Avg(10): -181.08 | Epsilon: 0.129 | Time: 7.44s
Episode 410 | Total Reward: -354.49 | Avg(10): -204.87 | Epsilon: 0.128 | Time: 7.37s
Episode 411 | Total Reward: -1.42 | Avg(10): -181.04 | Epsilon: 0.127 | Time: 7.32s
Episode 412 | Total Reward: -121.80 | Avg(10): -155.83 | Epsilon: 0.127 | Time: 7.32s
Episode 413 | Total Reward: -127.59 | Avg(10): -168.48 | Epsilon: 0.126 | Time: 7.44s
Episode 414 | Total Reward: -3.07 | Avg(10): -168.65 | Epsilon: 0.126 | Time: 7.28s
Episode 415 | Total Reward: -1.92 | Avg(10): -156.29 | Epsilon: 0.125 | Time: 7.28s
Episode 416 | Total Reward: -498.35 | Avg(10): -182.83 | Epsilon: 0.124 | Time: 7.38s
Episode 417 | Total Reward: -238.43 | Avg(10): -158.86 | Epsilon: 0.124 | Time: 7.31s
Episode 418 | Total Reward: -354.65 | Avg(10): -182.52 | Epsilon: 0.123 | Time: 7.25s
Episode 419 | Total Reward: -125.63 | Avg(10): -182.73 | Epsilon: 0.122 | Time: 7.37s

EARLY STOPPING at Episode 419
No improvement for 50 episodes
Last improvement at episode: 369
No description has been provided for this image
TRAINING COMPLETED
Episodes trained: 419
Convergence episode: 369
Best average reward over 10 episodes: -135.95
Best model weights saved to: 21act_early_stopping_weights.h5
Total training time: 3028.19s
Time per episode: 7.23s

Evaluating trained model...
Test Episode 1: Total Reward = -372.21
Test Episode 2: Total Reward = -124.35
Test Episode 3: Total Reward = -236.30
Test Episode 4: Total Reward = -249.07
Test Episode 5: Total Reward = -250.17
Test Episode 6: Total Reward = -124.90
Test Episode 7: Total Reward = -396.45
Test Episode 8: Total Reward = -118.87
Test Episode 9: Total Reward = -124.78
Test Episode 10: Total Reward = -125.52

Average Reward over 10 episodes: -212.26 ± 101.06
No description has been provided for this image
============================================================
EARLY STOPPING EXPERIMENT RESULTS:
Training stopped at episode: 419
Expected time savings vs 600ep: 30.2%
Performance: -212.26 ± 101.06
============================================================
In [22]:
def evaluate_early_stopping_epsilon_zero_robust(experiment_prefix, n_actions, num_episodes=20, num_runs=5):
    """Robust evaluation for early stopping models with epsilon=0"""
    
    # Same parameters as training
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200
    
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    # Recreate agent
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    
    try:
        agent.load(SAVE_WEIGHTS_PATH)
        agent.epsilon = 0.0  # Critical: set epsilon to 0 for evaluation
    except FileNotFoundError:
        print(f"Warning: Weights file {SAVE_WEIGHTS_PATH} not found")
        return None
    
    print(f"\nRobust Evaluation: {experiment_prefix} with epsilon=0.0")
    print(f"Running {num_runs} evaluation sessions of {num_episodes} episodes each")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        
        run_rewards = []
        
        for ep in range(num_episodes):
            s = env.reset()
            s = s if isinstance(s, np.ndarray) else s[0]
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(s)  # epsilon=0, so purely greedy
                torque = action_index_to_torque(a_idx, n_actions)
                s_next, r, done, info = env.step(torque)
                s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
                total_reward += r
                s = s_next
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # Confidence interval
    confidence_level = 0.95
    dof = len(all_means) - 1
    t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
    margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
    ci_lower = overall_mean - margin_of_error
    ci_upper = overall_mean + margin_of_error
    
    print(f"\n--- EVALUATION SUMMARY ---")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means
    }
In [27]:
def compare_early_stopping_vs_baseline_simple():
    """Simple comparison: early stopping vs baseline only"""
    
    print("="*60)
    print("EARLY STOPPING VS BASELINE COMPARISON")
    print("="*60)
    
    n_actions = 21
    
    # Evaluate early stopping
    print("\n1. Early Stopping Model:")
    early_result = evaluate_early_stopping_epsilon_zero_robust("21act_early_stopping", n_actions)
    
    # Evaluate baseline
    print("\n2. Baseline 600-Episode Model:")
    baseline_result = evaluate_early_stopping_epsilon_zero_robust("21act_600ep_extended", n_actions)
    
    # Compare
    if early_result and baseline_result:
        print(f"\n{'='*60}")
        print("COMPARISON RESULTS:")
        print(f"Early Stopping: {early_result['overall_mean']:.2f} ± {early_result['overall_std']:.2f}")
        print(f"                CI: [{early_result['ci_lower']:.2f}, {early_result['ci_upper']:.2f}]")
        print(f"Baseline 600ep: {baseline_result['overall_mean']:.2f} ± {baseline_result['overall_std']:.2f}")
        print(f"                CI: [{baseline_result['ci_lower']:.2f}, {baseline_result['ci_upper']:.2f}]")
        
        diff = early_result['overall_mean'] - baseline_result['overall_mean']
        print(f"\nPerformance Difference: {diff:+.2f}")
        
        # Check statistical significance
        ci_overlap = not (baseline_result['ci_upper'] < early_result['ci_lower'] or 
                         early_result['ci_upper'] < baseline_result['ci_lower'])
        significance = "NOT significant" if ci_overlap else "SIGNIFICANT"
        print(f"Statistical Significance: {significance}")
        
        # Performance retention
        retention = (early_result['overall_mean'] / baseline_result['overall_mean']) * 100
        print(f"Performance Retention: {retention:.1f}%")
        
        return {
            'early_stopping': early_result,
            'baseline': baseline_result,
            'difference': diff,
            'significant': not ci_overlap,
            'retention_pct': retention
        }
    
    return None
In [28]:
if __name__ == "__main__":
    # comparison focused on whether it is worth early stopping
    results = compare_early_stopping_vs_baseline_simple()
    
    if results:
        # Save results
        with open("early_stopping_comparison.json", "w") as f:
            json.dump(results, f, indent=2)
        print(f"\nResults saved to 'early_stopping_comparison.json'")
============================================================
EARLY STOPPING VS BASELINE COMPARISON
============================================================

1. Early Stopping Model:

Robust Evaluation: 21act_early_stopping with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -159.7 ± 58.2
--- Run 2/5 ---
Run 2: -162.5 ± 96.1
--- Run 3/5 ---
Run 3: -195.0 ± 101.9
--- Run 4/5 ---
Run 4: -193.6 ± 117.9
--- Run 5/5 ---
Run 5: -153.2 ± 113.8

--- EVALUATION SUMMARY ---
Overall mean: -172.80
Run-to-run std: 17.84
95% CI: [-194.95, -150.66]

2. Baseline 600-Episode Model:

Robust Evaluation: 21act_600ep_extended with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -218.3 ± 99.1
--- Run 2/5 ---
Run 2: -185.6 ± 111.8
--- Run 3/5 ---
Run 3: -168.8 ± 93.3
--- Run 4/5 ---
Run 4: -143.8 ± 93.5
--- Run 5/5 ---
Run 5: -138.8 ± 110.2

--- EVALUATION SUMMARY ---
Overall mean: -171.05
Run-to-run std: 29.10
95% CI: [-207.19, -134.91]

============================================================
COMPARISON RESULTS:
Early Stopping: -172.80 ± 17.84
                CI: [-194.95, -150.66]
Baseline 600ep: -171.05 ± 29.10
                CI: [-207.19, -134.91]

Performance Difference: -1.75
Statistical Significance: NOT significant
Performance Retention: 101.0%

Results saved to 'early_stopping_comparison.json'

Early Stopping Analysis & Observations

  • Key Results Summary
    • Training stopped at: Episode 419 (vs max 500)
    • Convergence detected at: Episode 369
    • Time savings: 30.2% compared to 600 episodes
    • Final performance: -172.80 ± 17.84 (robust evaluation with epsilon=0)
    • Training best: -135.95 (10-episode average)
    • Statistical significance: NOT significant vs 600-episode baseline
Experiment Episodes Training Best Eval Performance (Robust) Time (min) Efficiency Statistical Significance
21act_600ep_extended 600 -96.88 -171.05 ± 29.10 63.0 Baseline -
21act_early_stopping 419 -135.95 -172.80 ± 17.84 50.5 30.2% faster NOT significant

Successes:

  • Correctly identified convergence point around episode 369
  • Saved significant training time (30.2%) - 81 fewer episodes
  • Prevented overtraining beyond convergence point
  • Maintained equivalent performance - only 1.75 point difference (statistically insignificant)
  • Improved stability - lower run-to-run variance (17.84 vs 29.10)

Trade-offs:

  • minimal performance difference (-1.75 points, well within statistical noise)
  • Confidence intervals overlap (-194.95, -150.66) vs (-207.19, -134.91), confirming no significant difference
  • Performance retention: 101.0% - essentially identical performance for 30% less training time

Key Finding: Early stopping successfully maintained performance while saving training time - the difference between -172.80 and -

To use or not use early stopping? ¶

I would not use early stopping eventhough it seems successfull above but the truth is it causes issues:

  1. Premature Stopping Risk: Early stopping at episode 419 might have prevented the model from reaching the true optimum that was achieved at episode 600

  2. Conservative Approach: In research/production, it's often safer to train longer to ensure convergence rather than risk stopping too early

In [52]:
def generate_21ES():
    input_shape = 3
    
    # 21-action Early stopping
    experiments = [
        {
            "name": "21act_early_stopping", 
            "n_actions": 21,
            "checkpoints": [100, 200, 300, 400]
        }
    ]
    
    for exp in experiments:
        for ep in exp['checkpoints']:
            weights_path = f"{exp['name']}_{ep}_weights.h5"
            gif_path = f"{exp['name']}_ep{ep:03d}.gif"
            
            if os.path.exists(weights_path):
                print(f"Generating GIF for: {weights_path}")
                try:
                    visualize_checkpoint(
                        weights_path=weights_path,
                        n_actions=exp['n_actions'], 
                        gif_path=gif_path,
                        input_shape=input_shape
                    )
                except Exception as e:
                    print(f"Failed at {weights_path}: {e}")
            else:
                print(f"File not found: {weights_path}")
In [53]:
generate_21ES()
Generating GIF for: 21act_early_stopping_100_weights.h5
Saved GIF to 21act_early_stopping_ep100.gif (Total reward: -1188.12)
Generating GIF for: 21act_early_stopping_200_weights.h5
Saved GIF to 21act_early_stopping_ep200.gif (Total reward: -260.83)
Generating GIF for: 21act_early_stopping_300_weights.h5
Saved GIF to 21act_early_stopping_ep300.gif (Total reward: -123.51)
Generating GIF for: 21act_early_stopping_400_weights.h5
Saved GIF to 21act_early_stopping_ep400.gif (Total reward: -1.82)

Epsilon Exploration¶

Epsilon (ε) exploration is the exploration-exploitation trade-off mechanism in DQN that determines when the agent should:

  • Exploit: Use its current knowledge (choose best Q-value action)

  • Explore: Try random actions to discover potentially better strategies

Why it matters?

Our current results still shows high variance (±101.06) and bimodal performance distribution. This suggests:

  • Poor exploration: Agent didn't discover all good action sequences
  • Premature exploitation: Converged to suboptimal policies too quickly
  • Action space complexity: 21 discrete actions need more sophisticated exploration

Exploration Strategies to Test:

  1. Linear (Baseline):
  • Standard fixed decay rate (current approach)
  • epsilon = max(epsilon_min, epsilon * 0.995)
  1. Performance-Based (Adaptive Decay):
  • Adjust epsilon based on learning progress
  • Slower decay when improving, faster decay when stagnating
  1. Plateau Restart:
  • Reset epsilon during convergence plateaus
  • Boosts exploration when no improvement for 20+ episodes
  1. High Exploration:
  • Maintain higher exploration throughout training
  • Higher minimum epsilon (0.15 vs 0.05) and slower decay

Goal: Find the epsilon strategy that gives me:¶

  • Lower variance
  • Better mean performance
  • Maintained efficiency

YET TO SOLVE THE PLATEAU in extended training experiment¶

Recall about what happened in extended training experiment

  • If we recall during the extended training experiments, it revealed that the convergence occurred around episodes 250-300, with the agent reaching a plateau around -200 reward. This showed that the bottleneck wasn't insufficient training time, but rather inefficient exploration strategy.

  • My basic DQNAgent with linear epsilon decay was following a rigid time-based schedule that became counterproductive once the agent reached a certain performance level.

  • The steep improvement from episodes 0-250 showed the agent was learning rapidly, but the subsequent plateau indicated that traditional epsilon decay was no longer facilitating meaningful exploration of the action space.

Thus instead of stopping training early, I thought of making training more efficient by addressing the root cause through algorithmic improvment (advancedDQNAgent)

Further improvements

  1. Selective Episode Printing
  • Episodes 1-10: Always shown
  • Every 25th episode + key milestones (50, 100, 150, etc.)
  • Reduces output clutter while maintaining progress visibility
  1. AvancedDQNAgent
  • Implements intelligent stagnation detection that automatically restarts exploration when the agent gets trapped in local optima, directly addressing the convergence issues observed in extended training experiments around episodes 250-300

  • Provides systematic framework for testing four distinct exploration strategies (linear, performance-based, plateau restart, high exploration) rather than being locked into a single approach, enabling data-driven selection of optimal exploration methods

  • The plateau restart mechanism specifically targets the stagnation behavior identified in Phase 2, providing strategic exploration boosts that can break through performance barriers that linear decay cannot overcome

In [66]:
class AdvancedDQNAgent(DQNAgent):
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory, 
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, 
                 epsilon_decay, epsilon_strategy="linear"):
        
        super().__init__(input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                        batch_size, target_update_every, learning_rate, epsilon_start, 
                        epsilon_min, epsilon_decay)
        
        self.epsilon_strategy = epsilon_strategy
        self.epsilon_start = epsilon_start
        self.performance_history = deque(maxlen=50)
        self.last_improvement_episode = 0
        self.plateau_threshold = 20
        
    def adaptive_epsilon_decay(self, episode, recent_performance):
        """Adaptive epsilon based on learning progress"""
        
        if self.epsilon_strategy == "linear":
            # Standard linear decay
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "performance_based":
            # Slower decay when performance is improving
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:  # Improving
                    decay_rate = 0.998  # Slower decay
                    self.last_improvement_episode = episode
                else:  # Stagnating
                    decay_rate = 0.992  # Faster decay
                    
                return max(self.epsilon_min, self.epsilon * decay_rate)
            else:
                return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
                
        elif self.epsilon_strategy == "plateau_restart":
            # Restart epsilon when stuck in plateau
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    self.last_improvement_episode = episode
                
                # Check for plateau
                episodes_since_improvement = episode - self.last_improvement_episode
                if episodes_since_improvement >= self.plateau_threshold:
                    print(f"Epsilon restart at episode {episode}: {self.epsilon:.3f} → {self.epsilon_start * 0.3:.3f}")
                    self.epsilon = self.epsilon_start * 0.3  # Restart at 30% of initial
                    self.last_improvement_episode = episode
                    return self.epsilon
                    
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "high_exploration":
            # Maintain higher minimum epsilon for continued exploration
            epsilon_min_high = 0.15  # Instead of 0.05
            return max(epsilon_min_high, self.epsilon * 0.9995)  # Slower decay
            
        else:
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def decay_epsilon_advanced(self, episode, recent_performance):
        """Advanced epsilon decay with strategy-specific logic"""
        self.epsilon = self.adaptive_epsilon_decay(episode, recent_performance)
In [9]:
def evaluate_epsilon_zero_robust_strategy(experiment_prefix, n_actions, num_episodes=20, num_runs=5):
    """Robust evaluation with multiple runs for epsilon strategy experiments"""
    
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200
    
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    # Recreate agent (using base DQNAgent for evaluation)
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    
    try:
        agent.load(SAVE_WEIGHTS_PATH)
        agent.epsilon = 0.0  # Force pure exploitation
    except FileNotFoundError:
        print(f"Warning: Weights file {SAVE_WEIGHTS_PATH} not found")
        return None
    
    print(f"\nRobust Evaluation: {experiment_prefix} with epsilon=0.0")
    print(f"Running {num_runs} evaluation sessions of {num_episodes} episodes each")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        
        run_rewards = []
        
        for ep in range(num_episodes):
            s = env.reset()
            s = s if isinstance(s, np.ndarray) else s[0]
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(s)
                torque = action_index_to_torque(a_idx, n_actions)
                s_next, r, done, info = env.step(torque)
                s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
                total_reward += r
                s = s_next
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # All individual episode rewards
    all_rewards = []
    for run in all_run_results:
        all_rewards.extend(run['rewards'])
    
    # Confidence interval
    confidence_level = 0.95
    dof = len(all_means) - 1
    t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
    margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
    ci_lower = overall_mean - margin_of_error
    ci_upper = overall_mean + margin_of_error
    
    print(f"\n--- ROBUST EVALUATION SUMMARY ---")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print("-" * 50)
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes
    }
In [10]:
def train_epsilon_exploration_experiment(n_actions, epsilon_strategy, experiment_prefix):
    """Train with advanced epsilon exploration strategies - NO EARLY STOPPING"""
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    
    # FIXED 600 EPISODES - NO EARLY STOPPING
    MAX_EPISODES = 600
    MAX_STEPS = 200

    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    TRAIN_PLOT_PATH = f"{experiment_prefix}_training_plot.png"

    print("=" * 60)
    print(f"Running 21 Actions with {epsilon_strategy.title()} Epsilon Strategy")
    print("=" * 60)
    print()

    env = gym.make(ENV_NAME)
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=epsilon_strategy
    )
    
    print("Model Summary:")
    agent.summary()
    print()
    
    scores = []
    best_avg_reward = -np.inf
    episode_times = []
    epsilon_history = []
    best_episode = 0
    
    start = time.time()

    for ep in range(1, MAX_EPISODES + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            s_next, r, done, info = env.step(torque)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            agent.remember(s, a_idx, r, s_next, done)
            agent.train_step()
            s = s_next
            total_reward += r
            if done:
                break

        # Advanced epsilon decay
        recent_avg = np.mean(scores[-10:]) if len(scores) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_avg)
        epsilon_history.append(agent.epsilon)
        
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints every 100 episodes
        if ep % 100 == 0:
            agent.save(f"{experiment_prefix}_{ep}_weights.h5")
        
        scores.append(total_reward)
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        episode_times.append(ep_time)
        
        # Track best performance for final model saving
        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            best_episode = ep
            agent.save(SAVE_WEIGHTS_PATH)
        
        # Print episode info with timing (similar to your format)
        if ep <= 10 or ep % 25 == 0 or ep in [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600]:
            print(f"Episode {ep} | Total Reward: {total_reward:.2f} | Avg(10): {avg_reward:.2f} | Epsilon: {agent.epsilon:.3f} | Time: {ep_time:.2f}s")

    env.close()
    total_time = time.time() - start
    avg_time_per_episode = total_time / MAX_EPISODES

    print()
    print("TRAINING COMPLETED")
    print(f"Episodes trained: {MAX_EPISODES}")
    print(f"Best episode: {best_episode}")
    print(f"Best average reward over 10 episodes: {best_avg_reward:.2f}")
    print(f"Final epsilon: {agent.epsilon:.4f}")
    print(f"Best model weights saved to: {SAVE_WEIGHTS_PATH}")
    print(f"Total training time: {total_time:.2f}s")
    print(f"Time per episode: {avg_time_per_episode:.2f}s")
    print()

    # ROBUST EVALUATION with multiple runs
    print("Evaluating trained model...")
    eval_results = evaluate_epsilon_zero_robust_strategy(experiment_prefix, n_actions, num_episodes=20, num_runs=5)
    
    if eval_results:
        print(f"\nFINAL EVALUATION RESULTS ({epsilon_strategy.upper()}):")
        print(f"Robust evaluation (epsilon=0.0): {eval_results['overall_mean']:.2f} ± {eval_results['overall_std']:.2f}")
        print(f"95% Confidence Interval: [{eval_results['ci_lower']:.2f}, {eval_results['ci_upper']:.2f}]")
        print(f"Best episode: {max(eval_results['all_rewards']):.2f}")
        print(f"Worst episode: {min(eval_results['all_rewards']):.2f}")
        print(f"Total evaluation episodes: {eval_results['num_runs']} runs × {eval_results['num_episodes']} episodes = {eval_results['num_runs'] * eval_results['num_episodes']} episodes")
    
    return {
        'strategy': epsilon_strategy,
        'episodes_trained': MAX_EPISODES,
        'best_episode': best_episode,
        'best_training_reward': best_avg_reward,
        'eval_results': eval_results,
        'training_time': total_time,
        'time_per_episode': avg_time_per_episode,
        'final_epsilon': agent.epsilon
    }
In [23]:
def run_epsilon_exploration_experiments():
    """Run all epsilon exploration experiments with 600 episodes each"""
    
    # Set seeds for reproducibility
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    strategies = [
        "linear",              # Baseline (same as your previous training)
        "performance_based",   # Adaptive based on learning progress
        "plateau_restart",     # Restart epsilon during plateaus
        "high_exploration"     # Maintain higher exploration throughout
    ]
    
    results = {}
    n_actions = 21
    
    print("EPSILON EXPLORATION STRATEGY COMPARISON")
    print("600 EPISODES EACH - NO EARLY STOPPING")
    print("=" * 80)
    print()
    
    for i, strategy in enumerate(strategies, 1):
        experiment_prefix = f"21act_epsilon_{strategy}"
        
        print(f"EXPERIMENT {i}/{len(strategies)}: {strategy.upper()} STRATEGY")
        print("-" * 60)
        
        results[strategy] = train_epsilon_exploration_experiment(
            n_actions=n_actions,
            epsilon_strategy=strategy,
            experiment_prefix=experiment_prefix
        )
        
        print(f"\n{strategy.upper()} EXPERIMENT COMPLETED")
        print("=" * 60)
        print()
    
    # Enhanced comparison analysis with robust evaluation results
    print("=" * 80)
    print("EPSILON STRATEGY COMPARISON RESULTS")
    print("=" * 80)
    print()
    
    print(f"{'Strategy':<18} {'Episodes':<9} {'Best Ep':<8} {'Train Best':<11} {'Eval Mean':<11} {'Eval CI':<18} {'Time':<10}")
    print("-" * 90)
    
    for strategy, result in results.items():
        eval_results = result['eval_results']
        if eval_results:
            eval_mean = eval_results['overall_mean']
            ci_lower = eval_results['ci_lower']
            ci_upper = eval_results['ci_upper']
            ci_str = f"[{ci_lower:.1f}, {ci_upper:.1f}]"
        else:
            eval_mean = "N/A"
            ci_str = "N/A"
            
        print(f"{strategy.title():<18} {result['episodes_trained']:<9} {result['best_episode']:<8} "
              f"{result['best_training_reward']:<11.1f} {eval_mean:<11.1f} {ci_str:<18} "
              f"{result['training_time']/60:<10.1f}min")
    
    print()
    
    # Statistical analysis
    valid_results = {k: v for k, v in results.items() if v['eval_results'] is not None}
    
    if len(valid_results) > 1:
        print("STATISTICAL ANALYSIS:")
        print("-" * 40)
        
        # Find best strategy
        best_mean = max(valid_results.keys(), key=lambda x: valid_results[x]['eval_results']['overall_mean'])
        best_stability = min(valid_results.keys(), key=lambda x: valid_results[x]['eval_results']['overall_std'])
        
        print(f"Best Performance: {best_mean.upper()} ({valid_results[best_mean]['eval_results']['overall_mean']:.1f})")
        print(f"Best Stability: {best_stability.upper()} (±{valid_results[best_stability]['eval_results']['overall_std']:.1f})")
        print()
        
        # Check for statistical significance
        strategies_list = list(valid_results.keys())
        print("STATISTICAL SIGNIFICANCE (Confidence Interval Analysis):")
        print("-" * 50)
        
        baseline_key = "linear"  # Compare all to linear baseline
        if baseline_key in valid_results:
            baseline = valid_results[baseline_key]['eval_results']
            print(f"Comparing all strategies to {baseline_key.upper()} baseline:")
            print()
            
            for strategy in strategies_list:
                if strategy != baseline_key:
                    result = valid_results[strategy]['eval_results']
                    
                    # Check CI overlap
                    overlap = not (baseline['ci_upper'] < result['ci_lower'] or result['ci_upper'] < baseline['ci_lower'])
                    significance = "NOT significant" if overlap else "SIGNIFICANT"
                    
                    improvement = result['overall_mean'] - baseline['overall_mean']
                    print(f"{strategy.upper():<18} vs {baseline_key.upper()}: {improvement:+6.1f} ({significance})")
        
        print()
    
    # Save results
    with open("epsilon_strategy_comparison.json", "w") as f:
        # Convert numpy types to native Python types for JSON serialization
        json_results = {}
        for strategy, result in results.items():
            json_result = result.copy()
            if json_result['eval_results']:
                eval_results = json_result['eval_results'].copy()
                for key, value in eval_results.items():
                    if isinstance(value, np.ndarray):
                        eval_results[key] = value.tolist()
                    elif isinstance(value, (np.float64, np.float32)):
                        eval_results[key] = float(value)
                    elif isinstance(value, (np.int64, np.int32)):
                        eval_results[key] = int(value)
                json_result['eval_results'] = eval_results
            json_results[strategy] = json_result
            
        json.dump(json_results, f, indent=2)
    
    print(f"Results saved to 'epsilon_strategy_comparison.json'")
    print(f"Training plots saved for each strategy")
    return results
In [11]:
if __name__ == "__main__":
    # Run all epsilon exploration experiments
    epsilon_results = run_epsilon_exploration_experiments()
EPSILON EXPLORATION STRATEGY COMPARISON
600 EPISODES EACH - NO EARLY STOPPING
================================================================================

EXPERIMENT 1/4: LINEAR STRATEGY
------------------------------------------------------------
============================================================
Running 21 Actions with Linear Epsilon Strategy
============================================================

Model Summary:

Model Summary:
Model: "dqn"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               multiple                  256       
                                                                 
 dense_1 (Dense)             multiple                  4160      
                                                                 
 dense_2 (Dense)             multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Total Reward: -959.48 | Avg(10): -959.48 | Epsilon: 0.995 | Time: 0.02s
Episode 2 | Total Reward: -1696.36 | Avg(10): -1327.92 | Epsilon: 0.990 | Time: 0.03s
Episode 3 | Total Reward: -1498.99 | Avg(10): -1384.94 | Epsilon: 0.985 | Time: 0.02s
Episode 4 | Total Reward: -1410.85 | Avg(10): -1391.42 | Epsilon: 0.980 | Time: 0.03s
Episode 5 | Total Reward: -1706.43 | Avg(10): -1454.42 | Epsilon: 0.975 | Time: 0.13s
Episode 6 | Total Reward: -1315.62 | Avg(10): -1431.29 | Epsilon: 0.970 | Time: 5.15s
Episode 7 | Total Reward: -987.16 | Avg(10): -1367.84 | Epsilon: 0.966 | Time: 4.98s
Episode 8 | Total Reward: -1628.71 | Avg(10): -1400.45 | Epsilon: 0.961 | Time: 6.33s
Episode 9 | Total Reward: -1324.82 | Avg(10): -1392.05 | Epsilon: 0.956 | Time: 5.04s
Episode 10 | Total Reward: -1209.99 | Avg(10): -1373.84 | Epsilon: 0.951 | Time: 5.88s
Episode 25 | Total Reward: -1319.37 | Avg(10): -1219.60 | Epsilon: 0.882 | Time: 5.15s
Episode 50 | Total Reward: -1507.41 | Avg(10): -1264.74 | Epsilon: 0.778 | Time: 8.17s
Episode 75 | Total Reward: -972.53 | Avg(10): -1058.63 | Epsilon: 0.687 | Time: 8.59s
Episode 100 | Total Reward: -1017.70 | Avg(10): -1066.07 | Epsilon: 0.606 | Time: 7.83s
Episode 125 | Total Reward: -870.80 | Avg(10): -896.78 | Epsilon: 0.534 | Time: 8.87s
Episode 150 | Total Reward: -878.07 | Avg(10): -652.48 | Epsilon: 0.471 | Time: 38.12s
Episode 175 | Total Reward: -371.50 | Avg(10): -552.39 | Epsilon: 0.416 | Time: 13.62s
Episode 200 | Total Reward: -383.11 | Avg(10): -324.97 | Epsilon: 0.367 | Time: 17.46s
Episode 225 | Total Reward: -466.67 | Avg(10): -349.12 | Epsilon: 0.324 | Time: 7.83s
Episode 250 | Total Reward: -125.33 | Avg(10): -235.23 | Epsilon: 0.286 | Time: 9.42s
Episode 275 | Total Reward: -348.76 | Avg(10): -266.54 | Epsilon: 0.252 | Time: 8.00s
Episode 300 | Total Reward: -2.25 | Avg(10): -210.16 | Epsilon: 0.222 | Time: 10.57s
Episode 325 | Total Reward: -124.51 | Avg(10): -261.15 | Epsilon: 0.196 | Time: 12.05s
Episode 350 | Total Reward: -121.68 | Avg(10): -201.24 | Epsilon: 0.173 | Time: 6.52s
Episode 375 | Total Reward: -232.71 | Avg(10): -185.18 | Epsilon: 0.153 | Time: 7.46s
Episode 400 | Total Reward: -235.55 | Avg(10): -195.23 | Epsilon: 0.135 | Time: 7.35s
Episode 425 | Total Reward: -124.52 | Avg(10): -190.17 | Epsilon: 0.119 | Time: 9.37s
Episode 450 | Total Reward: -126.36 | Avg(10): -249.92 | Epsilon: 0.105 | Time: 7.82s
Episode 475 | Total Reward: -1.52 | Avg(10): -121.76 | Epsilon: 0.092 | Time: 8.79s
Episode 500 | Total Reward: -126.40 | Avg(10): -185.79 | Epsilon: 0.082 | Time: 15.38s
Episode 525 | Total Reward: -121.91 | Avg(10): -204.39 | Epsilon: 0.072 | Time: 9.56s
Episode 550 | Total Reward: -2.03 | Avg(10): -141.27 | Epsilon: 0.063 | Time: 10.92s
Episode 575 | Total Reward: -351.84 | Avg(10): -143.39 | Epsilon: 0.056 | Time: 16.05s
Episode 600 | Total Reward: -122.60 | Avg(10): -108.64 | Epsilon: 0.050 | Time: 16.83s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 389
Best average reward over 10 episodes: -72.81
Final epsilon: 0.0500
Best model weights saved to: 21act_epsilon_linear_weights.h5
Total training time: 6083.40s
Time per episode: 10.14s

Evaluating trained model...

Robust Evaluation: 21act_epsilon_linear with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -214.4 ± 396.1
--- Run 2/5 ---
Run 2: -152.6 ± 75.7
--- Run 3/5 ---
Run 3: -86.1 ± 67.6
--- Run 4/5 ---
Run 4: -122.8 ± 85.7
--- Run 5/5 ---
Run 5: -317.4 ± 529.1

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -178.68
Run-to-run std: 81.11
95% CI: [-279.39, -77.97]
--------------------------------------------------

FINAL EVALUATION RESULTS (LINEAR):
Robust evaluation (epsilon=0.0): -178.68 ± 81.11
95% Confidence Interval: [-279.39, -77.97]
Best episode: -0.69
Worst episode: -1902.69
Total evaluation episodes: 5 runs × 20 episodes = 100 episodes

LINEAR EXPERIMENT COMPLETED
============================================================

EXPERIMENT 2/4: PERFORMANCE_BASED STRATEGY
------------------------------------------------------------
============================================================
Running 21 Actions with Performance_Based Epsilon Strategy
============================================================

Model Summary:

Model Summary:
Model: "dqn_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_12 (Dense)            multiple                  256       
                                                                 
 dense_13 (Dense)            multiple                  4160      
                                                                 
 dense_14 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Total Reward: -861.25 | Avg(10): -861.25 | Epsilon: 0.995 | Time: 0.03s
Episode 2 | Total Reward: -916.06 | Avg(10): -888.65 | Epsilon: 0.990 | Time: 0.14s
Episode 3 | Total Reward: -1351.07 | Avg(10): -1042.79 | Epsilon: 0.985 | Time: 0.12s
Episode 4 | Total Reward: -1747.14 | Avg(10): -1218.88 | Epsilon: 0.980 | Time: 0.04s
Episode 5 | Total Reward: -968.47 | Avg(10): -1168.80 | Epsilon: 0.975 | Time: 0.39s
Episode 6 | Total Reward: -1518.34 | Avg(10): -1227.05 | Epsilon: 0.970 | Time: 10.83s
Episode 7 | Total Reward: -1659.64 | Avg(10): -1288.85 | Epsilon: 0.966 | Time: 9.99s
Episode 8 | Total Reward: -1063.90 | Avg(10): -1260.73 | Epsilon: 0.961 | Time: 14.88s
Episode 9 | Total Reward: -1346.26 | Avg(10): -1270.24 | Epsilon: 0.956 | Time: 7.49s
Episode 10 | Total Reward: -1073.01 | Avg(10): -1250.51 | Epsilon: 0.951 | Time: 7.39s
Episode 25 | Total Reward: -950.21 | Avg(10): -1190.02 | Epsilon: 0.893 | Time: 7.80s
Episode 50 | Total Reward: -1180.87 | Avg(10): -1234.88 | Epsilon: 0.785 | Time: 8.09s
Episode 75 | Total Reward: -1101.20 | Avg(10): -1073.11 | Epsilon: 0.747 | Time: 8.55s
Episode 100 | Total Reward: -1205.76 | Avg(10): -1135.61 | Epsilon: 0.618 | Time: 13.31s
Episode 125 | Total Reward: -1022.55 | Avg(10): -1061.47 | Epsilon: 0.528 | Time: 17.23s
Episode 150 | Total Reward: -764.15 | Avg(10): -658.80 | Epsilon: 0.502 | Time: 12.85s
Episode 175 | Total Reward: -809.67 | Avg(10): -641.90 | Epsilon: 0.447 | Time: 13.62s
Episode 200 | Total Reward: -252.51 | Avg(10): -529.01 | Epsilon: 0.415 | Time: 8.08s
Episode 225 | Total Reward: -128.19 | Avg(10): -335.44 | Epsilon: 0.395 | Time: 6.42s
Episode 250 | Total Reward: -609.10 | Avg(10): -316.55 | Epsilon: 0.351 | Time: 5.75s
Episode 275 | Total Reward: -246.80 | Avg(10): -228.94 | Epsilon: 0.322 | Time: 6.38s
Episode 300 | Total Reward: -247.49 | Avg(10): -210.48 | Epsilon: 0.290 | Time: 6.07s
Episode 325 | Total Reward: -247.81 | Avg(10): -160.28 | Epsilon: 0.276 | Time: 6.21s
Episode 350 | Total Reward: -1.56 | Avg(10): -239.92 | Epsilon: 0.241 | Time: 6.65s
Episode 375 | Total Reward: -250.47 | Avg(10): -302.11 | Epsilon: 0.215 | Time: 6.02s
Episode 400 | Total Reward: -368.77 | Avg(10): -207.43 | Epsilon: 0.190 | Time: 6.10s
Episode 425 | Total Reward: -116.80 | Avg(10): -166.16 | Epsilon: 0.159 | Time: 6.61s
Episode 450 | Total Reward: -238.07 | Avg(10): -169.07 | Epsilon: 0.141 | Time: 6.19s
Episode 475 | Total Reward: -118.43 | Avg(10): -138.82 | Epsilon: 0.120 | Time: 6.42s
Episode 500 | Total Reward: -125.03 | Avg(10): -122.75 | Epsilon: 0.103 | Time: 6.97s
Episode 525 | Total Reward: -124.37 | Avg(10): -196.23 | Epsilon: 0.091 | Time: 6.51s
Episode 550 | Total Reward: -125.96 | Avg(10): -145.23 | Epsilon: 0.085 | Time: 6.23s
Episode 575 | Total Reward: -126.08 | Avg(10): -145.02 | Epsilon: 0.072 | Time: 7.58s
Episode 600 | Total Reward: -1.73 | Avg(10): -183.45 | Epsilon: 0.063 | Time: 7.08s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 505
Best average reward over 10 episodes: -110.35
Final epsilon: 0.0633
Best model weights saved to: 21act_epsilon_performance_based_weights.h5
Total training time: 4618.71s
Time per episode: 7.70s

Evaluating trained model...

Robust Evaluation: 21act_epsilon_performance_based with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -146.4 ± 109.5
--- Run 2/5 ---
Run 2: -170.5 ± 110.2
--- Run 3/5 ---
Run 3: -164.8 ± 135.2
--- Run 4/5 ---
Run 4: -145.3 ± 104.0
--- Run 5/5 ---
Run 5: -230.2 ± 363.8

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -171.43
Run-to-run std: 31.02
95% CI: [-209.94, -132.91]
--------------------------------------------------

FINAL EVALUATION RESULTS (PERFORMANCE_BASED):
Robust evaluation (epsilon=0.0): -171.43 ± 31.02
95% Confidence Interval: [-209.94, -132.91]
Best episode: -0.09
Worst episode: -1777.96
Total evaluation episodes: 5 runs × 20 episodes = 100 episodes

PERFORMANCE_BASED EXPERIMENT COMPLETED
============================================================

EXPERIMENT 3/4: PLATEAU_RESTART STRATEGY
------------------------------------------------------------
============================================================
Running 21 Actions with Plateau_Restart Epsilon Strategy
============================================================

Model Summary:

Model Summary:
Model: "dqn_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_24 (Dense)            multiple                  256       
                                                                 
 dense_25 (Dense)            multiple                  4160      
                                                                 
 dense_26 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Total Reward: -1213.38 | Avg(10): -1213.38 | Epsilon: 0.995 | Time: 0.05s
Episode 2 | Total Reward: -1059.30 | Avg(10): -1136.34 | Epsilon: 0.990 | Time: 0.03s
Episode 3 | Total Reward: -1068.63 | Avg(10): -1113.77 | Epsilon: 0.985 | Time: 0.09s
Episode 4 | Total Reward: -1422.62 | Avg(10): -1190.98 | Epsilon: 0.980 | Time: 0.07s
Episode 5 | Total Reward: -1482.36 | Avg(10): -1249.26 | Epsilon: 0.975 | Time: 0.32s
Episode 6 | Total Reward: -896.97 | Avg(10): -1190.54 | Epsilon: 0.970 | Time: 12.55s
Episode 7 | Total Reward: -1148.31 | Avg(10): -1184.51 | Epsilon: 0.966 | Time: 9.28s
Episode 8 | Total Reward: -818.93 | Avg(10): -1138.81 | Epsilon: 0.961 | Time: 5.99s
Episode 9 | Total Reward: -1281.67 | Avg(10): -1154.69 | Epsilon: 0.956 | Time: 6.01s
Episode 10 | Total Reward: -1065.23 | Avg(10): -1145.74 | Epsilon: 0.951 | Time: 6.02s
Epsilon restart at episode 20: 0.909 → 0.300
Episode 25 | Total Reward: -1207.44 | Avg(10): -1283.26 | Epsilon: 0.293 | Time: 5.41s
Epsilon restart at episode 40: 0.273 → 0.300
Episode 50 | Total Reward: -894.37 | Avg(10): -1039.20 | Epsilon: 0.285 | Time: 5.52s
Episode 75 | Total Reward: -1329.07 | Avg(10): -1091.85 | Epsilon: 0.252 | Time: 5.72s
Episode 100 | Total Reward: -1035.07 | Avg(10): -1084.60 | Epsilon: 0.222 | Time: 5.61s
Episode 125 | Total Reward: -130.65 | Avg(10): -341.70 | Epsilon: 0.196 | Time: 5.28s
Episode 150 | Total Reward: -126.61 | Avg(10): -294.86 | Epsilon: 0.173 | Time: 5.33s
Episode 175 | Total Reward: -249.80 | Avg(10): -386.45 | Epsilon: 0.152 | Time: 5.48s
Episode 200 | Total Reward: -495.19 | Avg(10): -188.25 | Epsilon: 0.135 | Time: 5.84s
Episode 225 | Total Reward: -126.06 | Avg(10): -195.67 | Epsilon: 0.119 | Time: 5.48s
Episode 250 | Total Reward: -243.38 | Avg(10): -341.62 | Epsilon: 0.105 | Time: 5.28s
Epsilon restart at episode 254: 0.103 → 0.300
Episode 275 | Total Reward: -361.82 | Avg(10): -354.72 | Epsilon: 0.270 | Time: 5.87s
Episode 300 | Total Reward: -239.16 | Avg(10): -287.99 | Epsilon: 0.238 | Time: 8.07s
Episode 325 | Total Reward: -234.43 | Avg(10): -206.27 | Epsilon: 0.210 | Time: 7.19s
Episode 350 | Total Reward: -124.36 | Avg(10): -161.35 | Epsilon: 0.185 | Time: 9.34s
Epsilon restart at episode 351: 0.185 → 0.300
Episode 375 | Total Reward: -241.92 | Avg(10): -263.46 | Epsilon: 0.266 | Time: 8.84s
Epsilon restart at episode 383: 0.257 → 0.300
Episode 400 | Total Reward: -359.15 | Avg(10): -231.80 | Epsilon: 0.275 | Time: 6.99s
Episode 425 | Total Reward: -241.47 | Avg(10): -169.54 | Epsilon: 0.243 | Time: 7.14s
Episode 450 | Total Reward: -237.52 | Avg(10): -111.27 | Epsilon: 0.214 | Time: 7.98s
Episode 475 | Total Reward: -235.74 | Avg(10): -254.66 | Epsilon: 0.189 | Time: 7.56s
Epsilon restart at episode 478: 0.187 → 0.300
Episode 500 | Total Reward: -126.51 | Avg(10): -242.75 | Epsilon: 0.269 | Time: 7.34s
Episode 525 | Total Reward: -122.50 | Avg(10): -197.34 | Epsilon: 0.237 | Time: 6.80s
Episode 550 | Total Reward: -127.05 | Avg(10): -148.35 | Epsilon: 0.209 | Time: 8.71s
Episode 575 | Total Reward: -397.63 | Avg(10): -194.13 | Epsilon: 0.184 | Time: 7.42s
Episode 600 | Total Reward: -1.52 | Avg(10): -122.44 | Epsilon: 0.163 | Time: 7.31s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 446
Best average reward over 10 episodes: -99.09
Final epsilon: 0.1628
Best model weights saved to: 21act_epsilon_plateau_restart_weights.h5
Total training time: 4001.10s
Time per episode: 6.67s

Evaluating trained model...

Robust Evaluation: 21act_epsilon_plateau_restart with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -166.7 ± 91.3
--- Run 2/5 ---
Run 2: -154.6 ± 98.6
--- Run 3/5 ---
Run 3: -160.4 ± 75.4
--- Run 4/5 ---
Run 4: -165.9 ± 116.5
--- Run 5/5 ---
Run 5: -129.0 ± 99.3

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -155.32
Run-to-run std: 13.86
95% CI: [-172.53, -138.11]
--------------------------------------------------

FINAL EVALUATION RESULTS (PLATEAU_RESTART):
Robust evaluation (epsilon=0.0): -155.32 ± 13.86
95% Confidence Interval: [-172.53, -138.11]
Best episode: -0.72
Worst episode: -366.41
Total evaluation episodes: 5 runs × 20 episodes = 100 episodes

PLATEAU_RESTART EXPERIMENT COMPLETED
============================================================

EXPERIMENT 4/4: HIGH_EXPLORATION STRATEGY
------------------------------------------------------------
============================================================
Running 21 Actions with High_Exploration Epsilon Strategy
============================================================

Model Summary:

Model Summary:
Model: "dqn_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_36 (Dense)            multiple                  256       
                                                                 
 dense_37 (Dense)            multiple                  4160      
                                                                 
 dense_38 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Total Reward: -1431.53 | Avg(10): -1431.53 | Epsilon: 1.000 | Time: 0.02s
Episode 2 | Total Reward: -1001.00 | Avg(10): -1216.26 | Epsilon: 0.999 | Time: 0.02s
Episode 3 | Total Reward: -1688.64 | Avg(10): -1373.72 | Epsilon: 0.999 | Time: 0.03s
Episode 4 | Total Reward: -1327.31 | Avg(10): -1362.12 | Epsilon: 0.998 | Time: 0.02s
Episode 5 | Total Reward: -1555.70 | Avg(10): -1400.84 | Epsilon: 0.998 | Time: 0.11s
Episode 6 | Total Reward: -1497.93 | Avg(10): -1417.02 | Epsilon: 0.997 | Time: 7.26s
Episode 7 | Total Reward: -1073.38 | Avg(10): -1367.93 | Epsilon: 0.997 | Time: 7.72s
Episode 8 | Total Reward: -982.80 | Avg(10): -1319.79 | Epsilon: 0.996 | Time: 7.93s
Episode 9 | Total Reward: -980.57 | Avg(10): -1282.10 | Epsilon: 0.996 | Time: 7.99s
Episode 10 | Total Reward: -962.78 | Avg(10): -1250.16 | Epsilon: 0.995 | Time: 6.39s
Episode 25 | Total Reward: -904.82 | Avg(10): -1327.66 | Epsilon: 0.988 | Time: 6.68s
Episode 50 | Total Reward: -1077.35 | Avg(10): -1214.58 | Epsilon: 0.975 | Time: 6.80s
Episode 75 | Total Reward: -1642.35 | Avg(10): -1063.97 | Epsilon: 0.963 | Time: 6.05s
Episode 100 | Total Reward: -1485.59 | Avg(10): -1297.70 | Epsilon: 0.951 | Time: 6.68s
Episode 125 | Total Reward: -1330.21 | Avg(10): -1202.71 | Epsilon: 0.939 | Time: 6.35s
Episode 150 | Total Reward: -1263.11 | Avg(10): -1169.75 | Epsilon: 0.928 | Time: 6.34s
Episode 175 | Total Reward: -1060.40 | Avg(10): -1140.92 | Epsilon: 0.916 | Time: 7.48s
Episode 200 | Total Reward: -1287.33 | Avg(10): -1074.82 | Epsilon: 0.905 | Time: 7.25s
Episode 225 | Total Reward: -1017.85 | Avg(10): -1142.25 | Epsilon: 0.894 | Time: 5.80s
Episode 250 | Total Reward: -1026.77 | Avg(10): -982.50 | Epsilon: 0.882 | Time: 7.02s
Episode 275 | Total Reward: -1382.40 | Avg(10): -1163.35 | Epsilon: 0.872 | Time: 7.01s
Episode 300 | Total Reward: -1037.83 | Avg(10): -1103.23 | Epsilon: 0.861 | Time: 5.75s
Episode 325 | Total Reward: -1082.97 | Avg(10): -1027.85 | Epsilon: 0.850 | Time: 6.49s
Episode 350 | Total Reward: -1228.09 | Avg(10): -981.64 | Epsilon: 0.839 | Time: 7.11s
Episode 375 | Total Reward: -776.29 | Avg(10): -873.65 | Epsilon: 0.829 | Time: 6.90s
Episode 400 | Total Reward: -741.54 | Avg(10): -902.45 | Epsilon: 0.819 | Time: 7.36s
Episode 425 | Total Reward: -1433.21 | Avg(10): -1020.76 | Epsilon: 0.809 | Time: 6.62s
Episode 450 | Total Reward: -899.42 | Avg(10): -928.40 | Epsilon: 0.798 | Time: 6.53s
Episode 475 | Total Reward: -844.30 | Avg(10): -938.61 | Epsilon: 0.789 | Time: 8.13s
Episode 500 | Total Reward: -986.75 | Avg(10): -913.92 | Epsilon: 0.779 | Time: 6.91s
Episode 525 | Total Reward: -837.56 | Avg(10): -838.73 | Epsilon: 0.769 | Time: 6.47s
Episode 550 | Total Reward: -775.58 | Avg(10): -729.59 | Epsilon: 0.760 | Time: 5.47s
Episode 575 | Total Reward: -629.62 | Avg(10): -946.20 | Epsilon: 0.750 | Time: 5.90s
Episode 600 | Total Reward: -736.20 | Avg(10): -824.35 | Epsilon: 0.741 | Time: 6.82s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 549
Best average reward over 10 episodes: -727.77
Final epsilon: 0.7408
Best model weights saved to: 21act_epsilon_high_exploration_weights.h5
Total training time: 3971.47s
Time per episode: 6.62s

Evaluating trained model...

Robust Evaluation: 21act_epsilon_high_exploration with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -156.5 ± 94.8
--- Run 2/5 ---
Run 2: -181.1 ± 123.6
--- Run 3/5 ---
Run 3: -155.4 ± 75.5
--- Run 4/5 ---
Run 4: -148.1 ± 85.6
--- Run 5/5 ---
Run 5: -178.1 ± 82.2

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -163.84
Run-to-run std: 13.22
95% CI: [-180.26, -147.42]
--------------------------------------------------

FINAL EVALUATION RESULTS (HIGH_EXPLORATION):
Robust evaluation (epsilon=0.0): -163.84 ± 13.22
95% Confidence Interval: [-180.26, -147.42]
Best episode: -0.14
Worst episode: -512.14
Total evaluation episodes: 5 runs × 20 episodes = 100 episodes

HIGH_EXPLORATION EXPERIMENT COMPLETED
============================================================

================================================================================
EPSILON STRATEGY COMPARISON RESULTS
================================================================================

Strategy           Episodes  Best Ep  Train Best  Eval Mean   Eval CI            Time      
------------------------------------------------------------------------------------------
Linear             600       389      -72.8       -178.7      [-279.4, -78.0]    101.4     min
Performance_Based  600       505      -110.4      -171.4      [-209.9, -132.9]   77.0      min
Plateau_Restart    600       446      -99.1       -155.3      [-172.5, -138.1]   66.7      min
High_Exploration   600       549      -727.8      -163.8      [-180.3, -147.4]   66.2      min

STATISTICAL ANALYSIS:
----------------------------------------
Best Performance: PLATEAU_RESTART (-155.3)
Best Stability: HIGH_EXPLORATION (±13.2)

STATISTICAL SIGNIFICANCE (Confidence Interval Analysis):
--------------------------------------------------
Comparing all strategies to LINEAR baseline:

PERFORMANCE_BASED  vs LINEAR:   +7.3 (NOT significant)
PLATEAU_RESTART    vs LINEAR:  +23.4 (NOT significant)
HIGH_EXPLORATION   vs LINEAR:  +14.8 (NOT significant)

Results saved to 'epsilon_strategy_comparison.json'
Training plots saved for each strategy

Observations

  1. Performance Rankings (Best to Worst):
  • Plateau Restart: -155.3 (best mean performance)
  • High Exploration: -163.8 (best stability ±13.2)
  • Performance Based: -171.4
  • Linear (Baseline): -178.7 (worst performance)
  1. Statistical Significance:
  • ALL improvements are NOT statistically significant - confidence intervals overlap
  • However, Plateau Restart shows +23.4 improvement - largest practical difference
  • High Exploration has the tightest confidence interval (most consistent)

Key Insights:

  • Plateau Restart: Best performance + 35% faster training
  • High Exploration: Most stable results but found worse training optimum
  • Linear baseline: Slowest and worst performing

Epsilon strategy to use ¶

PLATEAU-RESTART Plateau-Restart Advantages:

  • Best evaluation performance: -155.3 vs -178.7 baseline (+23.4 improvement)
  • Training efficiency: 66.7 min vs 101.4 min (35% time savings)
  • Practical significance: Even if not statistically significant, +23 reward points is meaningful
  • Theoretical soundness: Escapes local minima by restarting exploration

Replay Memory Exploration¶

Replay Memory (also called Experience Replay) is a crucial component of DQN that stores and reuses past experiences to improve learning efficiency and stability.

How replay memory affects training and rewards

  1. Learning Stability
  • Without replay memory: Agent learns from consecutive, correlated experiences
  • With replay memory: Agent learns from random, diverse experiences
  • Impact: More stable Q-value updates, reduced overfitting
  1. Sample Efficiency
  • Reuse experiences: Each experience can be used multiple times
  • Better data utilization: Don't waste valuable experiences
  • Impact: Faster convergence, better final performance
  1. Breaking Correlation
  • Problem: Sequential experiences are highly correlated
  • Solution: Random sampling breaks temporal correlations
  • Impact: Prevents catastrophic forgetting, improves generalization
  1. Memory Size Trade-offs:
  • Too small: Limited diversity, recent bias, poor performance
  • Too large: Slow updates, memory issues, outdated experiences
  • Optimal size: Balance between diversity and relevance
In [38]:
def train_replay_memory_experiment(n_actions, replay_config, experiment_prefix):
    """Train with different replay memory configurations"""
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    GAMMA = 0.99
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    
    # Use your optimized settings
    MAX_EPISODES = 600
    MAX_STEPS = 200
    EPSILON_STRATEGY = "plateau_restart"  # Your best strategy

    # Replay memory configuration
    REPLAY_MEMORY_SIZE = replay_config["memory_size"]
    MIN_REPLAY_MEMORY = replay_config["min_memory"]

    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"

    print("=" * 70)
    print(f"Replay Memory Experiment: {replay_config['name'].upper()}")
    print(f"Memory Size: {REPLAY_MEMORY_SIZE:,} | Min Memory: {MIN_REPLAY_MEMORY:,}")
    print("=" * 70)
    print()

    env = gym.make(ENV_NAME)
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    print("Model Summary:")
    agent.summary()
    print()
    
    scores = []
    best_avg_reward = -np.inf
    episode_times = []
    epsilon_history = []
    memory_usage = []  # Track memory utilization
    training_steps = 0  # Track total training steps
    best_episode = 0
    
    start = time.time()

    for ep in range(1, MAX_EPISODES + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        total_reward = 0
        episode_training_steps = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            torque_array = np.array([torque])
            s_next, r, done, info = env.step(torque_array)
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            
            agent.remember(s, a_idx, r, s_next, done)
            
            # FIXED: Use agent.memory (the deque) to check length
            memory_length = len(agent.memory)
            
            # Train only if we have enough experiences
            if memory_length >= MIN_REPLAY_MEMORY:
                agent.train_step()
                training_steps += 1
                episode_training_steps += 1
            
            s = s_next
            total_reward += r
            if done:
                break

        # Track memory utilization
        memory_usage.append(len(agent.memory))

        # Advanced epsilon decay
        recent_avg = np.mean(scores[-10:]) if len(scores) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_avg)
        epsilon_history.append(agent.epsilon)
        
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints every 100 episodes
        if ep % 100 == 0:
            agent.save(f"{experiment_prefix}_{ep}_weights.h5")
        
        scores.append(total_reward)
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        episode_times.append(ep_time)
        
        # Track best performance for final model saving
        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            best_episode = ep
            agent.save(SAVE_WEIGHTS_PATH)
        
        # Print episode info with memory stats
        if ep <= 10 or ep % 25 == 0 or ep in [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600]:
            memory_pct = (len(agent.memory) / REPLAY_MEMORY_SIZE) * 100
            print(f"Episode {ep} | Reward: {total_reward:.2f} | Avg(10): {avg_reward:.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({memory_pct:.1f}%) | "
                  f"Steps: {episode_training_steps} | Time: {ep_time:.2f}s")

    env.close()
    total_time = time.time() - start
    avg_time_per_episode = total_time / MAX_EPISODES

    print()
    print("TRAINING COMPLETED")
    print(f"Episodes trained: {MAX_EPISODES}")
    print(f"Best episode: {best_episode}")
    print(f"Best average reward over 10 episodes: {best_avg_reward:.2f}")
    print(f"Final epsilon: {agent.epsilon:.4f}")
    print(f"Total training steps: {training_steps:,}")
    print(f"Final memory size: {len(agent.memory):,}/{REPLAY_MEMORY_SIZE:,}")
    print(f"Best model weights saved to: {SAVE_WEIGHTS_PATH}")
    print(f"Total training time: {total_time:.2f}s")
    print(f"Time per episode: {avg_time_per_episode:.2f}s")
    print()

    # ROBUST EVALUATION
    print("Evaluating trained model...")
    eval_results = evaluate_replay_memory_robust(experiment_prefix, n_actions, num_episodes=20, num_runs=5)
    
    return {
        'config_name': replay_config['name'],
        'memory_size': REPLAY_MEMORY_SIZE,
        'min_memory': MIN_REPLAY_MEMORY,
        'episodes_trained': MAX_EPISODES,
        'best_episode': best_episode,
        'best_training_reward': best_avg_reward,
        'eval_results': eval_results,
        'training_time': total_time,
        'time_per_episode': avg_time_per_episode,
        'total_training_steps': training_steps,
        'final_memory_usage': len(agent.memory),
        'memory_usage_history': memory_usage,
        'scores_history': scores
    }
In [39]:
def evaluate_replay_memory_robust(experiment_prefix, n_actions, num_episodes=20, num_runs=5):
    """Robust evaluation for replay memory experiments"""
    
    INPUT_SHAPE = 3
    GAMMA = 0.99
    REPLAY_MEMORY_SIZE = 50000  # Use standard for evaluation
    MIN_REPLAY_MEMORY = 1000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200
    
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    # Recreate agent (use base DQNAgent for evaluation)
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    
    try:
        agent.load(SAVE_WEIGHTS_PATH)
        agent.epsilon = 0.0  # Force pure exploitation
    except FileNotFoundError:
        print(f"Warning: Weights file {SAVE_WEIGHTS_PATH} not found")
        return None
    
    print(f"\nRobust Evaluation: {experiment_prefix} with epsilon=0.0")
    print(f"Running {num_runs} evaluation sessions of {num_episodes} episodes each")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        
        run_rewards = []
        
        for ep in range(num_episodes):
            s = env.reset()
            s = s if isinstance(s, np.ndarray) else s[0]
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(s)
                torque = action_index_to_torque(a_idx, n_actions)
                torque_array = np.array([torque])
                s_next, r, done, info = env.step(torque_array)
                s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
                total_reward += r
                s = s_next
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # All individual episode rewards
    all_rewards = []
    for run in all_run_results:
        all_rewards.extend(run['rewards'])
    
    # Confidence interval
    confidence_level = 0.95
    dof = len(all_means) - 1
    t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
    margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
    ci_lower = overall_mean - margin_of_error
    ci_upper = overall_mean + margin_of_error
    
    print(f"\n--- ROBUST EVALUATION SUMMARY ---")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print("-" * 50)
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes
    }
In [10]:
def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)

def run_replay_memory_exploration():
    """Run comprehensive replay memory exploration"""
    
    # Set seeds for reproducibility
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    # Replay memory configurations to test
    memory_configs = [
        {
            "name": "small",
            "memory_size": 10000,
            "min_memory": 500,
            "description": "Small memory - fast but limited diversity"
        },
        {
            "name": "medium_low_min",
            "memory_size": 25000,
            "min_memory": 1000,
            "description": "Medium memory with standard min threshold"
        },
        {
            "name": "current_baseline",
            "memory_size": 50000,
            "min_memory": 1000,
            "description": "Your current configuration (baseline)"
        },
        {
            "name": "large",
            "memory_size": 100000,
            "min_memory": 2000,
            "description": "Large memory - more diversity but slower"
        },
        {
            "name": "high_min_threshold",
            "memory_size": 50000,
            "min_memory": 5000,
            "description": "Standard memory with higher min threshold"
        }
    ]
    
    results = {}
    n_actions = 21  # Your optimized action space
    
    print("REPLAY MEMORY EXPLORATION")
    print("21 Actions | 600 Episodes | Plateau Restart Strategy")
    print("=" * 80)
    print()
    
    for i, config in enumerate(memory_configs, 1):
        experiment_prefix = f"21act_replay_{config['name']}"
        
        print(f"EXPERIMENT {i}/{len(memory_configs)}: {config['name'].upper()}")
        print(f"Description: {config['description']}")
        print("-" * 70)
        
        results[config['name']] = train_replay_memory_experiment(
            n_actions=n_actions,
            replay_config=config,
            experiment_prefix=experiment_prefix
        )
        
        print(f"\n{config['name'].upper()} EXPERIMENT COMPLETED")
        print("=" * 70)
        print()
    
    # Analysis and comparison
    create_replay_memory_analysis(results, memory_configs)
    
    return results
In [11]:
def create_replay_memory_analysis(results, memory_configs):
    """Create comprehensive replay memory analysis"""
    
    print("=" * 80)
    print("REPLAY MEMORY COMPARISON RESULTS")
    print("=" * 80)
    print()
    
    # Comparison table
    print(f"{'Config':<20} {'Memory Size':<12} {'Min Memory':<11} {'Training Best':<13} {'Eval Mean':<11} {'Training Time':<13}")
    print("-" * 90)
    
    baseline_performance = None
    
    for config_name, result in results.items():
        eval_results = result['eval_results']
        eval_mean = eval_results['overall_mean'] if eval_results else "N/A"
        
        if config_name == "current_baseline":
            baseline_performance = eval_mean
        
        print(f"{config_name.title():<20} {result['memory_size']:<12,} {result['min_memory']:<11,} "
              f"{result['best_training_reward']:<13.1f} {eval_mean:<11.1f} "
              f"{result['training_time']/60:<13.1f}min")
    
    print()
    
    # Performance analysis
    if baseline_performance is not None:
        print("PERFORMANCE ANALYSIS vs BASELINE:")
        print("-" * 50)
        
        for config_name, result in results.items():
            if config_name != "current_baseline":
                eval_results = result['eval_results']
                if eval_results:
                    improvement = eval_results['overall_mean'] - baseline_performance
                    efficiency = result['training_time'] / results['current_baseline']['training_time']
                    
                    print(f"{config_name.title():<20}: {improvement:+6.1f} reward ({efficiency:.2f}x time)")
        
        print()
    
    # Statistical significance analysis
    print("STATISTICAL SIGNIFICANCE ANALYSIS:")
    print("-" * 50)
    
    baseline_config = "current_baseline"
    if baseline_config in results and results[baseline_config]['eval_results']:
        baseline_eval = results[baseline_config]['eval_results']
        
        for config_name, result in results.items():
            if config_name != baseline_config and result['eval_results']:
                eval_results = result['eval_results']
                
                # Check confidence interval overlap
                baseline_ci = [baseline_eval['ci_lower'], baseline_eval['ci_upper']]
                config_ci = [eval_results['ci_lower'], eval_results['ci_upper']]
                
                overlap = not (baseline_ci[1] < config_ci[0] or config_ci[1] < baseline_ci[0])
                significance = "NOT significant" if overlap else "SIGNIFICANT"
                
                improvement = eval_results['overall_mean'] - baseline_eval['overall_mean']
                print(f"{config_name.title():<20}: {improvement:+6.1f} ({significance})")
    
    print()
    
    # Create visualizations
    create_replay_memory_plots(results, memory_configs)
    
    # Save results
    save_replay_memory_results(results)
In [12]:
def create_replay_memory_plots(results, memory_configs):
    """Create comprehensive visualization of replay memory results"""
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 16))
    fig.suptitle('Replay Memory Configuration Analysis (21 Actions, 600 Episodes)', fontsize=20)
    
    # Extract data for plotting
    config_names = []
    memory_sizes = []
    min_memories = []
    eval_means = []
    eval_stds = []
    training_times = []
    training_best = []
    
    colors = ['red', 'blue', 'green', 'orange', 'purple']
    
    for i, (config_name, result) in enumerate(results.items()):
        config_names.append(config_name.replace('_', '\n').title())
        memory_sizes.append(result['memory_size'])
        min_memories.append(result['min_memory'])
        training_best.append(result['best_training_reward'])
        training_times.append(result['training_time'] / 60)  # Convert to minutes
        
        if result['eval_results']:
            eval_means.append(result['eval_results']['overall_mean'])
            eval_stds.append(result['eval_results']['overall_std'])
        else:
            eval_means.append(0)
            eval_stds.append(0)
    
    # Plot 1: Evaluation Performance
    bars1 = ax1.bar(config_names, eval_means, yerr=eval_stds, 
                   color=colors[:len(config_names)], alpha=0.7, capsize=5)
    ax1.set_title('Evaluation Performance by Memory Configuration', fontsize=14)
    ax1.set_ylabel('Mean Reward ± Std Dev')
    ax1.grid(True, alpha=0.3)
    ax1.tick_params(axis='x', rotation=45)
    
    # Highlight baseline
    baseline_idx = next((i for i, name in enumerate(config_names) if 'baseline' in name.lower()), 0)
    if baseline_idx < len(bars1):
        bars1[baseline_idx].set_edgecolor('black')
        bars1[baseline_idx].set_linewidth(3)
    
    for bar, mean, std in zip(bars1, eval_means, eval_stds):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + std + 5,
                f'{mean:.1f}±{std:.1f}', ha='center', va='bottom', fontsize=10)
    
    # Plot 2: Memory Size vs Performance
    ax2.scatter(memory_sizes, eval_means, c=colors[:len(config_names)], 
               s=200, alpha=0.7, edgecolors='black', linewidth=2)
    
    for i, name in enumerate(config_names):
        ax2.annotate(name, (memory_sizes[i], eval_means[i]), 
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    ax2.set_xlabel('Replay Memory Size')
    ax2.set_ylabel('Evaluation Performance')
    ax2.set_title('Performance vs Memory Size', fontsize=14)
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: Training Time Analysis
    bars3 = ax3.bar(config_names, training_times, 
                   color=colors[:len(config_names)], alpha=0.7)
    ax3.set_title('Training Time by Memory Configuration', fontsize=14)
    ax3.set_ylabel('Training Time (minutes)')
    ax3.grid(True, alpha=0.3)
    ax3.tick_params(axis='x', rotation=45)
    
    for bar, time in zip(bars3, training_times):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 1,
                f'{time:.1f}m', ha='center', va='bottom', fontsize=10)
    
    # Plot 4: Efficiency Analysis (Performance per minute)
    efficiency = [perf / time if time > 0 else 0 for perf, time in zip(eval_means, training_times)]
    bars4 = ax4.bar(config_names, efficiency, 
                   color=colors[:len(config_names)], alpha=0.7)
    ax4.set_title('Training Efficiency (Performance per Minute)', fontsize=14)
    ax4.set_ylabel('Reward per Training Minute')
    ax4.grid(True, alpha=0.3)
    ax4.tick_params(axis='x', rotation=45)
    
    for bar, eff in zip(bars4, efficiency):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                f'{eff:.2f}', ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.savefig("replay_memory_comprehensive_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
In [13]:
def save_replay_memory_results(results):
    """Save replay memory results to JSON"""
    
    # Convert numpy types for JSON serialization
    json_results = {}
    for config_name, result in results.items():
        json_result = result.copy()
        
        # Convert eval_results
        if json_result['eval_results']:
            eval_results = json_result['eval_results'].copy()
            for key, value in eval_results.items():
                if isinstance(value, np.ndarray):
                    eval_results[key] = value.tolist()
                elif isinstance(value, (np.float64, np.float32)):
                    eval_results[key] = float(value)
                elif isinstance(value, (np.int64, np.int32)):
                    eval_results[key] = int(value)
            json_result['eval_results'] = eval_results
        
        # Convert other numpy arrays
        if 'memory_usage_history' in json_result:
            json_result['memory_usage_history'] = [int(x) for x in json_result['memory_usage_history']]
        if 'scores_history' in json_result:
            json_result['scores_history'] = [float(x) for x in json_result['scores_history']]
            
        json_results[config_name] = json_result
    
    with open("replay_memory_exploration_results.json", "w") as f:
        json.dump(json_results, f, indent=2)
    
    print("Results saved to 'replay_memory_exploration_results.json'")
In [36]:
if __name__ == "__main__":
    # Run replay memory exploration
    replay_results = run_replay_memory_exploration()
REPLAY MEMORY EXPLORATION
21 Actions | 600 Episodes | Plateau Restart Strategy
================================================================================

EXPERIMENT 1/5: SMALL
Description: Small memory - fast but limited diversity
----------------------------------------------------------------------
======================================================================
Replay Memory Experiment: SMALL
Memory Size: 10,000 | Min Memory: 500
======================================================================

Model Summary:

Model Summary:
Model: "dqn_32"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_96 (Dense)            multiple                  256       
                                                                 
 dense_97 (Dense)            multiple                  4160      
                                                                 
 dense_98 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1646.50 | Avg(10): -1646.50 | ε: 0.995 | Memory: 200 (2.0%) | Steps: 0 | Time: 0.01s
Episode 2 | Reward: -1270.85 | Avg(10): -1458.68 | ε: 0.990 | Memory: 400 (4.0%) | Steps: 0 | Time: 0.02s
Episode 3 | Reward: -1281.28 | Avg(10): -1399.55 | ε: 0.985 | Memory: 600 (6.0%) | Steps: 101 | Time: 2.76s
Episode 4 | Reward: -1371.50 | Avg(10): -1392.53 | ε: 0.980 | Memory: 800 (8.0%) | Steps: 200 | Time: 4.89s
Episode 5 | Reward: -986.42 | Avg(10): -1311.31 | ε: 0.975 | Memory: 1,000 (10.0%) | Steps: 200 | Time: 4.98s
Episode 6 | Reward: -1621.32 | Avg(10): -1362.98 | ε: 0.970 | Memory: 1,200 (12.0%) | Steps: 200 | Time: 5.07s
Episode 7 | Reward: -1402.97 | Avg(10): -1368.69 | ε: 0.966 | Memory: 1,400 (14.0%) | Steps: 200 | Time: 5.06s
Episode 8 | Reward: -749.44 | Avg(10): -1291.29 | ε: 0.961 | Memory: 1,600 (16.0%) | Steps: 200 | Time: 5.33s
Episode 9 | Reward: -1388.14 | Avg(10): -1302.05 | ε: 0.956 | Memory: 1,800 (18.0%) | Steps: 200 | Time: 5.88s
Episode 10 | Reward: -1173.01 | Avg(10): -1289.14 | ε: 0.951 | Memory: 2,000 (20.0%) | Steps: 200 | Time: 5.70s
Episode 25 | Reward: -1786.59 | Avg(10): -1386.06 | ε: 0.882 | Memory: 5,000 (50.0%) | Steps: 200 | Time: 5.82s
Epsilon restart at episode 41: 0.818 → 0.300
Episode 50 | Reward: -1220.10 | Avg(10): -1066.98 | ε: 0.287 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.76s
Episode 75 | Reward: -1339.37 | Avg(10): -1123.62 | ε: 0.253 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.73s
Episode 100 | Reward: -1048.61 | Avg(10): -1102.33 | ε: 0.223 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.10s
Episode 125 | Reward: -243.08 | Avg(10): -323.31 | ε: 0.197 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.14s
Episode 150 | Reward: -132.49 | Avg(10): -521.71 | ε: 0.174 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.25s
Episode 175 | Reward: -125.34 | Avg(10): -260.84 | ε: 0.153 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.50s
Episode 200 | Reward: -117.57 | Avg(10): -196.38 | ε: 0.135 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.34s
Episode 225 | Reward: -244.77 | Avg(10): -313.64 | ε: 0.119 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.41s
Episode 250 | Reward: -122.16 | Avg(10): -260.55 | ε: 0.105 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.16s
Episode 275 | Reward: -124.96 | Avg(10): -184.26 | ε: 0.093 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 7.74s
Episode 300 | Reward: -1.49 | Avg(10): -192.03 | ε: 0.082 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.33s
Epsilon restart at episode 322: 0.074 → 0.300
Episode 325 | Reward: -369.37 | Avg(10): -281.27 | ε: 0.296 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.59s
Epsilon restart at episode 342: 0.273 → 0.300
Episode 350 | Reward: -485.39 | Avg(10): -317.55 | ε: 0.288 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.69s
Episode 375 | Reward: -243.84 | Avg(10): -245.24 | ε: 0.254 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.44s
Episode 400 | Reward: -249.51 | Avg(10): -161.39 | ε: 0.224 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.10s
Episode 425 | Reward: -125.76 | Avg(10): -184.70 | ε: 0.198 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.88s
Episode 450 | Reward: -119.62 | Avg(10): -181.31 | ε: 0.175 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.34s
Episode 475 | Reward: -125.32 | Avg(10): -240.18 | ε: 0.154 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.15s
Epsilon restart at episode 481: 0.150 → 0.300
Episode 500 | Reward: -126.74 | Avg(10): -328.07 | ε: 0.273 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.84s
Episode 525 | Reward: -252.20 | Avg(10): -212.02 | ε: 0.241 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.29s
Episode 550 | Reward: -229.67 | Avg(10): -268.95 | ε: 0.212 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.26s
Episode 575 | Reward: -352.62 | Avg(10): -228.02 | ε: 0.187 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 5.75s
Episode 600 | Reward: -120.91 | Avg(10): -168.16 | ε: 0.165 | Memory: 10,000 (100.0%) | Steps: 200 | Time: 6.46s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 418
Best average reward over 10 episodes: -136.63
Final epsilon: 0.1652
Total training steps: 119,501
Final memory size: 10,000/10,000
Best model weights saved to: 21act_replay_small_weights.h5
Total training time: 3709.05s
Time per episode: 6.18s

Evaluating trained model...

Robust Evaluation: 21act_replay_small with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -183.5 ± 118.4
--- Run 2/5 ---
Run 2: -175.9 ± 136.4
--- Run 3/5 ---
Run 3: -169.2 ± 92.8
--- Run 4/5 ---
Run 4: -160.1 ± 110.4
--- Run 5/5 ---
Run 5: -175.7 ± 87.4

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -172.89
Run-to-run std: 7.84
95% CI: [-182.64, -163.15]
--------------------------------------------------

SMALL EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 2/5: MEDIUM_LOW_MIN
Description: Medium memory with standard min threshold
----------------------------------------------------------------------
======================================================================
Replay Memory Experiment: MEDIUM_LOW_MIN
Memory Size: 25,000 | Min Memory: 1,000
======================================================================

Model Summary:

Model Summary:
Model: "dqn_36"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_108 (Dense)           multiple                  256       
                                                                 
 dense_109 (Dense)           multiple                  4160      
                                                                 
 dense_110 (Dense)           multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1546.54 | Avg(10): -1546.54 | ε: 0.995 | Memory: 200 (0.8%) | Steps: 0 | Time: 0.01s
Episode 2 | Reward: -1079.06 | Avg(10): -1312.80 | ε: 0.990 | Memory: 400 (1.6%) | Steps: 0 | Time: 0.01s
Episode 3 | Reward: -966.04 | Avg(10): -1197.21 | ε: 0.985 | Memory: 600 (2.4%) | Steps: 0 | Time: 0.01s
Episode 4 | Reward: -1544.03 | Avg(10): -1283.92 | ε: 0.980 | Memory: 800 (3.2%) | Steps: 0 | Time: 0.03s
Episode 5 | Reward: -877.45 | Avg(10): -1202.62 | ε: 0.975 | Memory: 1,000 (4.0%) | Steps: 1 | Time: 0.06s
Episode 6 | Reward: -1656.30 | Avg(10): -1278.23 | ε: 0.970 | Memory: 1,200 (4.8%) | Steps: 200 | Time: 5.26s
Episode 7 | Reward: -1574.03 | Avg(10): -1320.49 | ε: 0.966 | Memory: 1,400 (5.6%) | Steps: 200 | Time: 5.31s
Episode 8 | Reward: -1536.06 | Avg(10): -1347.44 | ε: 0.961 | Memory: 1,600 (6.4%) | Steps: 200 | Time: 7.31s
Episode 9 | Reward: -1294.53 | Avg(10): -1341.56 | ε: 0.956 | Memory: 1,800 (7.2%) | Steps: 200 | Time: 6.06s
Episode 10 | Reward: -1476.57 | Avg(10): -1355.06 | ε: 0.951 | Memory: 2,000 (8.0%) | Steps: 200 | Time: 5.65s
Epsilon restart at episode 20: 0.909 → 0.300
Episode 25 | Reward: -1481.45 | Avg(10): -1457.14 | ε: 0.293 | Memory: 5,000 (20.0%) | Steps: 200 | Time: 5.49s
Episode 50 | Reward: -1394.14 | Avg(10): -932.08 | ε: 0.258 | Memory: 10,000 (40.0%) | Steps: 200 | Time: 5.32s
Episode 75 | Reward: -977.06 | Avg(10): -1129.35 | ε: 0.228 | Memory: 15,000 (60.0%) | Steps: 200 | Time: 5.33s
Epsilon restart at episode 93: 0.209 → 0.300
Episode 100 | Reward: -915.65 | Avg(10): -1067.46 | ε: 0.290 | Memory: 20,000 (80.0%) | Steps: 200 | Time: 5.44s
Episode 125 | Reward: -243.34 | Avg(10): -361.54 | ε: 0.256 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.48s
Episode 150 | Reward: -124.60 | Avg(10): -238.14 | ε: 0.225 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 7.50s
Episode 175 | Reward: -392.60 | Avg(10): -351.75 | ε: 0.199 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 6.03s
Episode 200 | Reward: -237.44 | Avg(10): -283.53 | ε: 0.175 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.86s
Episode 225 | Reward: -118.16 | Avg(10): -185.08 | ε: 0.155 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 6.00s
Episode 250 | Reward: -125.29 | Avg(10): -113.27 | ε: 0.137 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.77s
Episode 275 | Reward: -121.81 | Avg(10): -173.00 | ε: 0.120 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.35s
Episode 300 | Reward: -253.56 | Avg(10): -165.28 | ε: 0.106 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.40s
Episode 325 | Reward: -1.82 | Avg(10): -186.28 | ε: 0.094 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.28s
Episode 350 | Reward: -126.79 | Avg(10): -181.19 | ε: 0.083 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 6.21s
Episode 375 | Reward: -125.32 | Avg(10): -203.42 | ε: 0.073 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 6.55s
Episode 400 | Reward: -124.12 | Avg(10): -120.29 | ε: 0.064 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 6.11s
Episode 425 | Reward: -236.90 | Avg(10): -172.15 | ε: 0.057 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 6.18s
Epsilon restart at episode 428: 0.056 → 0.300
Epsilon restart at episode 448: 0.273 → 0.300
Episode 450 | Reward: -122.09 | Avg(10): -191.94 | ε: 0.297 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.69s
Episode 475 | Reward: -248.82 | Avg(10): -184.48 | ε: 0.262 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.63s
Episode 500 | Reward: -123.35 | Avg(10): -217.02 | ε: 0.231 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.89s
Episode 525 | Reward: -235.94 | Avg(10): -180.95 | ε: 0.204 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.58s
Episode 550 | Reward: -252.10 | Avg(10): -171.71 | ε: 0.180 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.73s
Episode 575 | Reward: -126.59 | Avg(10): -189.50 | ε: 0.159 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.49s
Episode 600 | Reward: -2.54 | Avg(10): -122.26 | ε: 0.140 | Memory: 25,000 (100.0%) | Steps: 200 | Time: 5.76s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 396
Best average reward over 10 episodes: -109.29
Final epsilon: 0.1400
Total training steps: 119,001
Final memory size: 25,000/25,000
Best model weights saved to: 21act_replay_medium_low_min_weights.h5
Total training time: 3520.94s
Time per episode: 5.87s

Evaluating trained model...

Robust Evaluation: 21act_replay_medium_low_min with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -137.7 ± 54.1
--- Run 2/5 ---
Run 2: -158.2 ± 94.3
--- Run 3/5 ---
Run 3: -252.6 ± 326.8
--- Run 4/5 ---
Run 4: -161.9 ± 95.4
--- Run 5/5 ---
Run 5: -121.0 ± 88.8

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -166.27
Run-to-run std: 45.60
95% CI: [-222.89, -109.66]
--------------------------------------------------

MEDIUM_LOW_MIN EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 3/5: CURRENT_BASELINE
Description: Your current configuration (baseline)
----------------------------------------------------------------------
======================================================================
Replay Memory Experiment: CURRENT_BASELINE
Memory Size: 50,000 | Min Memory: 1,000
======================================================================

Model Summary:

Model Summary:
Model: "dqn_40"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_120 (Dense)           multiple                  256       
                                                                 
 dense_121 (Dense)           multiple                  4160      
                                                                 
 dense_122 (Dense)           multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -982.02 | Avg(10): -982.02 | ε: 0.995 | Memory: 200 (0.4%) | Steps: 0 | Time: 0.01s
Episode 2 | Reward: -903.54 | Avg(10): -942.78 | ε: 0.990 | Memory: 400 (0.8%) | Steps: 0 | Time: 0.02s
Episode 3 | Reward: -1499.85 | Avg(10): -1128.47 | ε: 0.985 | Memory: 600 (1.2%) | Steps: 0 | Time: 0.01s
Episode 4 | Reward: -1358.03 | Avg(10): -1185.86 | ε: 0.980 | Memory: 800 (1.6%) | Steps: 0 | Time: 0.02s
Episode 5 | Reward: -1641.46 | Avg(10): -1276.98 | ε: 0.975 | Memory: 1,000 (2.0%) | Steps: 1 | Time: 0.06s
Episode 6 | Reward: -994.71 | Avg(10): -1229.93 | ε: 0.970 | Memory: 1,200 (2.4%) | Steps: 200 | Time: 5.22s
Episode 7 | Reward: -1517.45 | Avg(10): -1271.01 | ε: 0.966 | Memory: 1,400 (2.8%) | Steps: 200 | Time: 5.27s
Episode 8 | Reward: -1007.18 | Avg(10): -1238.03 | ε: 0.961 | Memory: 1,600 (3.2%) | Steps: 200 | Time: 5.56s
Episode 9 | Reward: -1263.69 | Avg(10): -1240.88 | ε: 0.956 | Memory: 1,800 (3.6%) | Steps: 200 | Time: 5.30s
Episode 10 | Reward: -1579.96 | Avg(10): -1274.79 | ε: 0.951 | Memory: 2,000 (4.0%) | Steps: 200 | Time: 5.35s
Epsilon restart at episode 20: 0.909 → 0.300
Episode 25 | Reward: -1771.91 | Avg(10): -1464.39 | ε: 0.293 | Memory: 5,000 (10.0%) | Steps: 200 | Time: 6.03s
Episode 50 | Reward: -936.53 | Avg(10): -1143.67 | ε: 0.258 | Memory: 10,000 (20.0%) | Steps: 200 | Time: 5.93s
Episode 75 | Reward: -1039.63 | Avg(10): -1090.97 | ε: 0.228 | Memory: 15,000 (30.0%) | Steps: 200 | Time: 6.00s
Epsilon restart at episode 89: 0.213 → 0.300
Episode 100 | Reward: -985.26 | Avg(10): -963.21 | ε: 0.284 | Memory: 20,000 (40.0%) | Steps: 200 | Time: 5.54s
Episode 125 | Reward: -475.97 | Avg(10): -362.67 | ε: 0.250 | Memory: 25,000 (50.0%) | Steps: 200 | Time: 5.89s
Episode 150 | Reward: -1385.00 | Avg(10): -580.72 | ε: 0.221 | Memory: 30,000 (60.0%) | Steps: 200 | Time: 5.82s
Epsilon restart at episode 159: 0.212 → 0.300
Episode 175 | Reward: -124.73 | Avg(10): -294.80 | ε: 0.277 | Memory: 35,000 (70.0%) | Steps: 200 | Time: 6.69s
Episode 200 | Reward: -454.39 | Avg(10): -222.98 | ε: 0.244 | Memory: 40,000 (80.0%) | Steps: 200 | Time: 5.85s
Episode 225 | Reward: -125.08 | Avg(10): -222.88 | ε: 0.215 | Memory: 45,000 (90.0%) | Steps: 200 | Time: 6.00s
Epsilon restart at episode 227: 0.214 → 0.300
Episode 250 | Reward: -2.22 | Avg(10): -296.65 | ε: 0.267 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.18s
Episode 275 | Reward: -1.77 | Avg(10): -148.60 | ε: 0.236 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.87s
Episode 300 | Reward: -364.96 | Avg(10): -221.71 | ε: 0.208 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.18s
Epsilon restart at episode 310: 0.199 → 0.300
Episode 325 | Reward: -126.43 | Avg(10): -211.45 | ε: 0.278 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.59s
Episode 350 | Reward: -238.22 | Avg(10): -244.16 | ε: 0.245 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.05s
Epsilon restart at episode 359: 0.236 → 0.300
Episode 375 | Reward: -632.61 | Avg(10): -242.35 | ε: 0.277 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.83s
Episode 400 | Reward: -126.68 | Avg(10): -162.89 | ε: 0.244 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.85s
Episode 425 | Reward: -248.64 | Avg(10): -235.80 | ε: 0.215 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.83s
Epsilon restart at episode 431: 0.210 → 0.300
Episode 450 | Reward: -472.42 | Avg(10): -208.25 | ε: 0.273 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.49s
Episode 475 | Reward: -380.72 | Avg(10): -208.20 | ε: 0.241 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.71s
Epsilon restart at episode 476: 0.241 → 0.300
Episode 500 | Reward: -251.87 | Avg(10): -187.37 | ε: 0.266 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.32s
Epsilon restart at episode 524: 0.237 → 0.300
Episode 525 | Reward: -243.81 | Avg(10): -257.06 | ε: 0.298 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.95s
Episode 550 | Reward: -125.98 | Avg(10): -287.31 | ε: 0.263 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.62s
Episode 575 | Reward: -126.18 | Avg(10): -244.50 | ε: 0.232 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.98s
Episode 600 | Reward: -122.87 | Avg(10): -275.29 | ε: 0.205 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.85s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 281
Best average reward over 10 episodes: -135.26
Final epsilon: 0.2050
Total training steps: 119,001
Final memory size: 50,000/50,000
Best model weights saved to: 21act_replay_current_baseline_weights.h5
Total training time: 3608.49s
Time per episode: 6.01s

Evaluating trained model...

Robust Evaluation: 21act_replay_current_baseline with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -117.5 ± 96.0
--- Run 2/5 ---
Run 2: -196.5 ± 118.3
--- Run 3/5 ---
Run 3: -176.4 ± 138.4
--- Run 4/5 ---
Run 4: -192.0 ± 82.0
--- Run 5/5 ---
Run 5: -198.6 ± 323.0

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -176.19
Run-to-run std: 30.38
95% CI: [-213.90, -138.47]
--------------------------------------------------

CURRENT_BASELINE EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 4/5: LARGE
Description: Large memory - more diversity but slower
----------------------------------------------------------------------
======================================================================
Replay Memory Experiment: LARGE
Memory Size: 100,000 | Min Memory: 2,000
======================================================================

Model Summary:

Model Summary:
Model: "dqn_44"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_132 (Dense)           multiple                  256       
                                                                 
 dense_133 (Dense)           multiple                  4160      
                                                                 
 dense_134 (Dense)           multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1041.62 | Avg(10): -1041.62 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.01s
Episode 2 | Reward: -742.46 | Avg(10): -892.04 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.04s
Episode 3 | Reward: -1355.22 | Avg(10): -1046.43 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s
Episode 4 | Reward: -969.37 | Avg(10): -1027.17 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.02s
Episode 5 | Reward: -1383.99 | Avg(10): -1098.53 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.03s
Episode 6 | Reward: -1312.42 | Avg(10): -1134.18 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.02s
Episode 7 | Reward: -939.68 | Avg(10): -1106.39 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s
Episode 8 | Reward: -874.00 | Avg(10): -1077.35 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.03s
Episode 9 | Reward: -1509.02 | Avg(10): -1125.31 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.05s
Episode 10 | Reward: -1308.58 | Avg(10): -1143.64 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.08s
Epsilon restart at episode 20: 0.909 → 0.300
Episode 25 | Reward: -1170.61 | Avg(10): -1399.62 | ε: 0.293 | Memory: 5,000 (5.0%) | Steps: 200 | Time: 5.62s
Episode 50 | Reward: -514.80 | Avg(10): -1224.02 | ε: 0.258 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 5.43s
Episode 75 | Reward: -382.62 | Avg(10): -614.71 | ε: 0.228 | Memory: 15,000 (15.0%) | Steps: 200 | Time: 6.01s
Epsilon restart at episode 94: 0.208 → 0.300
Episode 100 | Reward: -1173.71 | Avg(10): -1135.99 | ε: 0.291 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 5.98s
Epsilon restart at episode 114: 0.273 → 0.300
Episode 125 | Reward: -508.52 | Avg(10): -817.79 | ε: 0.284 | Memory: 25,000 (25.0%) | Steps: 200 | Time: 6.40s
Episode 150 | Reward: -634.11 | Avg(10): -477.39 | ε: 0.250 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.36s
Episode 175 | Reward: -665.88 | Avg(10): -508.56 | ε: 0.221 | Memory: 35,000 (35.0%) | Steps: 200 | Time: 6.71s
Episode 200 | Reward: -356.91 | Avg(10): -289.11 | ε: 0.195 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 5.30s
Episode 225 | Reward: -379.19 | Avg(10): -169.98 | ε: 0.172 | Memory: 45,000 (45.0%) | Steps: 200 | Time: 6.58s
Episode 250 | Reward: -122.91 | Avg(10): -164.77 | ε: 0.152 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 6.57s
Episode 275 | Reward: -230.15 | Avg(10): -181.22 | ε: 0.134 | Memory: 55,000 (55.0%) | Steps: 200 | Time: 7.33s
Episode 300 | Reward: -119.78 | Avg(10): -210.50 | ε: 0.118 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 5.36s
Episode 325 | Reward: -328.25 | Avg(10): -205.06 | ε: 0.104 | Memory: 65,000 (65.0%) | Steps: 200 | Time: 5.79s
Episode 350 | Reward: -118.10 | Avg(10): -155.74 | ε: 0.092 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 6.45s
Episode 375 | Reward: -348.55 | Avg(10): -228.12 | ε: 0.081 | Memory: 75,000 (75.0%) | Steps: 200 | Time: 5.34s
Epsilon restart at episode 382: 0.079 → 0.300
Episode 400 | Reward: -2.39 | Avg(10): -275.49 | ε: 0.274 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 7.44s
Epsilon restart at episode 402: 0.273 → 0.300
Episode 425 | Reward: -1.47 | Avg(10): -234.72 | ε: 0.267 | Memory: 85,000 (85.0%) | Steps: 200 | Time: 5.42s
Episode 450 | Reward: -359.59 | Avg(10): -196.96 | ε: 0.236 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 5.32s
Episode 475 | Reward: -121.75 | Avg(10): -170.51 | ε: 0.208 | Memory: 95,000 (95.0%) | Steps: 200 | Time: 5.91s
Episode 500 | Reward: -123.93 | Avg(10): -193.14 | ε: 0.184 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.31s
Episode 525 | Reward: -347.22 | Avg(10): -205.30 | ε: 0.162 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.51s
Episode 550 | Reward: -121.47 | Avg(10): -135.90 | ε: 0.143 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.42s
Episode 575 | Reward: -1.70 | Avg(10): -146.38 | ε: 0.126 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.27s
Episode 600 | Reward: -312.27 | Avg(10): -142.28 | ε: 0.111 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.51s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 584
Best average reward over 10 episodes: -96.05
Final epsilon: 0.1112
Total training steps: 118,001
Final memory size: 100,000/100,000
Best model weights saved to: 21act_replay_large_weights.h5
Total training time: 3640.93s
Time per episode: 6.07s

Evaluating trained model...

Robust Evaluation: 21act_replay_large with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -132.3 ± 74.4
--- Run 2/5 ---
Run 2: -146.0 ± 89.2
--- Run 3/5 ---
Run 3: -162.6 ± 93.3
--- Run 4/5 ---
Run 4: -147.2 ± 77.9
--- Run 5/5 ---
Run 5: -171.4 ± 91.7

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -151.90
Run-to-run std: 13.69
95% CI: [-168.91, -134.90]
--------------------------------------------------

LARGE EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 5/5: HIGH_MIN_THRESHOLD
Description: Standard memory with higher min threshold
----------------------------------------------------------------------
======================================================================
Replay Memory Experiment: HIGH_MIN_THRESHOLD
Memory Size: 50,000 | Min Memory: 5,000
======================================================================

Model Summary:

Model Summary:
Model: "dqn_48"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_144 (Dense)           multiple                  256       
                                                                 
 dense_145 (Dense)           multiple                  4160      
                                                                 
 dense_146 (Dense)           multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1349.94 | Avg(10): -1349.94 | ε: 0.995 | Memory: 200 (0.4%) | Steps: 0 | Time: 0.01s
Episode 2 | Reward: -1013.39 | Avg(10): -1181.67 | ε: 0.990 | Memory: 400 (0.8%) | Steps: 0 | Time: 0.01s
Episode 3 | Reward: -1157.96 | Avg(10): -1173.76 | ε: 0.985 | Memory: 600 (1.2%) | Steps: 0 | Time: 0.03s
Episode 4 | Reward: -1196.14 | Avg(10): -1179.36 | ε: 0.980 | Memory: 800 (1.6%) | Steps: 0 | Time: 0.01s
Episode 5 | Reward: -1171.74 | Avg(10): -1177.84 | ε: 0.975 | Memory: 1,000 (2.0%) | Steps: 0 | Time: 0.04s
Episode 6 | Reward: -1566.97 | Avg(10): -1242.69 | ε: 0.970 | Memory: 1,200 (2.4%) | Steps: 0 | Time: 0.02s
Episode 7 | Reward: -875.74 | Avg(10): -1190.27 | ε: 0.966 | Memory: 1,400 (2.8%) | Steps: 0 | Time: 0.05s
Episode 8 | Reward: -898.99 | Avg(10): -1153.86 | ε: 0.961 | Memory: 1,600 (3.2%) | Steps: 0 | Time: 0.03s
Episode 9 | Reward: -1313.98 | Avg(10): -1171.65 | ε: 0.956 | Memory: 1,800 (3.6%) | Steps: 0 | Time: 0.05s
Episode 10 | Reward: -1597.24 | Avg(10): -1214.21 | ε: 0.951 | Memory: 2,000 (4.0%) | Steps: 0 | Time: 0.08s
Epsilon restart at episode 20: 0.909 → 0.300
Episode 25 | Reward: -1813.56 | Avg(10): -1304.03 | ε: 0.293 | Memory: 5,000 (10.0%) | Steps: 1 | Time: 0.43s
Episode 50 | Reward: -1311.87 | Avg(10): -1467.74 | ε: 0.258 | Memory: 10,000 (20.0%) | Steps: 200 | Time: 5.26s
Episode 75 | Reward: -1216.21 | Avg(10): -1086.89 | ε: 0.228 | Memory: 15,000 (30.0%) | Steps: 200 | Time: 5.49s
Episode 100 | Reward: -896.87 | Avg(10): -901.97 | ε: 0.201 | Memory: 20,000 (40.0%) | Steps: 200 | Time: 5.37s
Episode 125 | Reward: -249.32 | Avg(10): -443.04 | ε: 0.177 | Memory: 25,000 (50.0%) | Steps: 200 | Time: 6.20s
Episode 150 | Reward: -519.18 | Avg(10): -235.47 | ε: 0.156 | Memory: 30,000 (60.0%) | Steps: 200 | Time: 5.97s
Episode 175 | Reward: -381.12 | Avg(10): -165.30 | ε: 0.138 | Memory: 35,000 (70.0%) | Steps: 200 | Time: 7.71s
Episode 200 | Reward: -120.60 | Avg(10): -150.02 | ε: 0.122 | Memory: 40,000 (80.0%) | Steps: 200 | Time: 6.18s
Episode 225 | Reward: -1.98 | Avg(10): -181.46 | ε: 0.107 | Memory: 45,000 (90.0%) | Steps: 200 | Time: 6.03s
Epsilon restart at episode 226: 0.107 → 0.300
Epsilon restart at episode 246: 0.273 → 0.300
Episode 250 | Reward: -240.39 | Avg(10): -220.68 | ε: 0.294 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 9.45s
Epsilon restart at episode 266: 0.273 → 0.300
Episode 275 | Reward: -248.46 | Avg(10): -285.30 | ε: 0.287 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 7.23s
Episode 300 | Reward: -121.61 | Avg(10): -254.24 | ε: 0.253 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.03s
Episode 325 | Reward: -232.80 | Avg(10): -171.31 | ε: 0.223 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.20s
Episode 350 | Reward: -124.18 | Avg(10): -182.65 | ε: 0.197 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 10.07s
Epsilon restart at episode 359: 0.189 → 0.300
Episode 375 | Reward: -120.89 | Avg(10): -191.50 | ε: 0.277 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 7.51s
Epsilon restart at episode 390: 0.258 → 0.300
Episode 400 | Reward: -356.86 | Avg(10): -183.63 | ε: 0.285 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.05s
Episode 425 | Reward: -442.05 | Avg(10): -253.08 | ε: 0.252 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.97s
Epsilon restart at episode 429: 0.248 → 0.300
Episode 450 | Reward: -121.55 | Avg(10): -205.63 | ε: 0.270 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 7.35s
Episode 475 | Reward: -357.22 | Avg(10): -208.64 | ε: 0.238 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.48s
Episode 500 | Reward: -235.65 | Avg(10): -189.45 | ε: 0.210 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.93s
Episode 525 | Reward: -245.67 | Avg(10): -198.88 | ε: 0.185 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 8.75s
Epsilon restart at episode 536: 0.176 → 0.300
Episode 550 | Reward: -491.72 | Avg(10): -343.63 | ε: 0.280 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.71s
Epsilon restart at episode 556: 0.273 → 0.300
Episode 575 | Reward: -237.85 | Avg(10): -293.62 | ε: 0.273 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 6.13s
Episode 600 | Reward: -366.19 | Avg(10): -283.43 | ε: 0.241 | Memory: 50,000 (100.0%) | Steps: 200 | Time: 5.83s

TRAINING COMPLETED
Episodes trained: 600
Best episode: 361
Best average reward over 10 episodes: -133.75
Final epsilon: 0.2406
Total training steps: 115,001
Final memory size: 50,000/50,000
Best model weights saved to: 21act_replay_high_min_threshold_weights.h5
Total training time: 3901.75s
Time per episode: 6.50s

Evaluating trained model...

Robust Evaluation: 21act_replay_high_min_threshold with epsilon=0.0
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -141.7 ± 80.9
--- Run 2/5 ---
Run 2: -167.3 ± 102.5
--- Run 3/5 ---
Run 3: -175.6 ± 72.7
--- Run 4/5 ---
Run 4: -181.2 ± 96.5
--- Run 5/5 ---
Run 5: -148.6 ± 115.2

--- ROBUST EVALUATION SUMMARY ---
Overall mean: -162.85
Run-to-run std: 15.29
95% CI: [-181.84, -143.86]
--------------------------------------------------

HIGH_MIN_THRESHOLD EXPERIMENT COMPLETED
======================================================================

================================================================================
REPLAY MEMORY COMPARISON RESULTS
================================================================================

Config               Memory Size  Min Memory  Training Best Eval Mean   Training Time
------------------------------------------------------------------------------------------
Small                10,000       500         -136.6        -172.9      61.8         min
Medium_Low_Min       25,000       1,000       -109.3        -166.3      58.7         min
Current_Baseline     50,000       1,000       -135.3        -176.2      60.1         min
Large                100,000      2,000       -96.1         -151.9      60.7         min
High_Min_Threshold   50,000       5,000       -133.8        -162.8      65.0         min

PERFORMANCE ANALYSIS vs BASELINE:
--------------------------------------------------
Small               :   +3.3 reward (1.03x time)
Medium_Low_Min      :   +9.9 reward (0.98x time)
Large               :  +24.3 reward (1.01x time)
High_Min_Threshold  :  +13.3 reward (1.08x time)

STATISTICAL SIGNIFICANCE ANALYSIS:
--------------------------------------------------
Small               :   +3.3 (NOT significant)
Medium_Low_Min      :   +9.9 (NOT significant)
Large               :  +24.3 (NOT significant)
High_Min_Threshold  :  +13.3 (NOT significant)

No description has been provided for this image
Results saved to 'replay_memory_exploration_results.json'

Spotted an issue regarding

  • In the evaluate_replay_memory_robust(), the replay memory size I set was fixed to 50000, which I should not be doing.
  • I should be evaluating using the same replay size that was used in training since replay memory is part of the learned agent and it can affect the results

I had to re-evaluate¶

In [7]:
def evaluate_replay_memory_corrected(experiment_prefix, n_actions, memory_config, num_episodes=20, num_runs=5):
    """Corrected evaluation with MATCHING memory configuration"""
    
    INPUT_SHAPE = 3
    GAMMA = 0.99
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    MAX_STEPS = 200
    
    # FIX: Use SAME memory configuration as training
    REPLAY_MEMORY_SIZE = memory_config["memory_size"]
    MIN_REPLAY_MEMORY = memory_config["min_memory"]
    
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    print(f"\nCorrected Evaluation: {experiment_prefix}")
    print(f"Memory Size: {REPLAY_MEMORY_SIZE:,} | Min Memory: {MIN_REPLAY_MEMORY:,}")
    
    # Recreate agent with MATCHING memory configuration
    agent = DQNAgent(INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, 
                    MIN_REPLAY_MEMORY, BATCH_SIZE, TARGET_UPDATE_EVERY, 
                    LEARNING_RATE, EPSILON_START, EPSILON_MIN, EPSILON_DECAY)
    
    try:
        agent.load(SAVE_WEIGHTS_PATH)
        agent.epsilon = 0.0  # Force pure exploitation
        print(f" Loaded weights from {SAVE_WEIGHTS_PATH}")
    except FileNotFoundError:
        print(f" Warning: Weights file {SAVE_WEIGHTS_PATH} not found")
        return None
    
    print(f"Running {num_runs} evaluation sessions of {num_episodes} episodes each")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        
        run_rewards = []
        
        for ep in range(num_episodes):
            s = env.reset()
            
            # FIX: Ensure proper state shape
            if isinstance(s, tuple):
                s = s[0]  # Handle new gym API
            
            # Ensure state is proper numpy array with shape (3,)
            s = np.array(s, dtype=np.float32)
            if s.shape != (3,):
                s = s.flatten()[:3]  # Ensure exactly 3 elements
                
            total_reward = 0
            
            for t in range(MAX_STEPS):
                # FIX: Ensure state is properly shaped before action selection
                a_idx = agent.select_action(s)  # epsilon=0, purely greedy
                torque = action_index_to_torque(a_idx, n_actions)
                
                # Step environment
                s_next, r, done, info = env.step([torque])  # Pendulum expects array
                
                # FIX: Ensure proper state shape for next state
                if isinstance(s_next, tuple):
                    s_next = s_next[0]
                    
                s_next = np.array(s_next, dtype=np.float32)
                if s_next.shape != (3,):
                    s_next = s_next.flatten()[:3]
                
                total_reward += r
                s = s_next
                
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # All individual episode rewards
    all_rewards = []
    for run in all_run_results:
        all_rewards.extend(run['rewards'])
    
    # Confidence interval
    confidence_level = 0.95
    dof = len(all_means) - 1
    t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
    margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
    ci_lower = overall_mean - margin_of_error
    ci_upper = overall_mean + margin_of_error
    
    print(f"\n--- CORRECTED EVALUATION SUMMARY ---")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print("-" * 50)
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes,
        'memory_size': REPLAY_MEMORY_SIZE,
        'min_memory': MIN_REPLAY_MEMORY
    }
In [22]:
def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)

def run_corrected_replay_memory_evaluation():
    """Re-evaluate all replay memory configurations with corrected memory sizes"""
    
    # Set seeds for reproducibility
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    # Same configurations as original experiment
    memory_configs = [
        {
            "name": "small",
            "memory_size": 10000,
            "min_memory": 500,
            "description": "Small memory - fast but limited diversity"
        },
        {
            "name": "medium_low_min",
            "memory_size": 25000,
            "min_memory": 1000,
            "description": "Medium memory with standard min threshold"
        },
        {
            "name": "current_baseline",
            "memory_size": 50000,
            "min_memory": 1000,
            "description": "Your current configuration (baseline)"
        },
        {
            "name": "large",
            "memory_size": 100000,
            "min_memory": 2000,
            "description": "Large memory - more diversity but slower"
        },
        {
            "name": "high_min_threshold",
            "memory_size": 50000,
            "min_memory": 5000,
            "description": "Standard memory with higher min threshold"
        }
    ]
    
    corrected_results = {}
    n_actions = 21
    
    print("CORRECTED REPLAY MEMORY RE-EVALUATION")
    print("Using MATCHING memory configurations for evaluation")
    print("=" * 80)
    print()
    
    for i, config in enumerate(memory_configs, 1):
        experiment_prefix = f"21act_replay_{config['name']}"
        
        print(f"CORRECTED EVALUATION {i}/{len(memory_configs)}: {config['name'].upper()}")
        print(f"Training Memory: {config['memory_size']:,} | Min: {config['min_memory']:,}")
        print("-" * 70)
        
        corrected_results[config['name']] = evaluate_replay_memory_corrected(
            experiment_prefix=experiment_prefix,
            n_actions=n_actions,
            memory_config=config,
            num_episodes=20,
            num_runs=5
        )
        
        print(f"\n{config['name'].upper()} CORRECTED EVALUATION COMPLETED")
        print("=" * 70)
        print()
    
    # Compare original vs corrected results
    compare_original_vs_corrected(corrected_results)
    
    return corrected_results
In [23]:
def compare_original_vs_corrected(corrected_results):
    """Compare original vs corrected evaluation results"""
    
    # Original results (from your experiment)
    original_results = {
        'small': -172.9,
        'medium_low_min': -166.3,
        'current_baseline': -176.2,
        'large': -151.9,
        'high_min_threshold': -162.8
    }
    
    print("=" * 80)
    print("ORIGINAL vs CORRECTED EVALUATION COMPARISON")
    print("=" * 80)
    print()
    
    print(f"{'Config':<20} {'Original':<12} {'Corrected':<12} {'Difference':<12} {'Impact':<15}")
    print("-" * 75)
    
    total_differences = []
    
    for config_name in corrected_results.keys():
        if config_name in original_results and corrected_results[config_name]:
            original = original_results[config_name]
            corrected = corrected_results[config_name]['overall_mean']
            difference = corrected - original
            total_differences.append(abs(difference))
            
            impact = "SIGNIFICANT" if abs(difference) > 10 else "MINOR"
            
            print(f"{config_name.title():<20} {original:<12.1f} {corrected:<12.1f} "
                  f"{difference:<+12.1f} {impact:<15}")
    
    print()
    
    # Statistical analysis
    if total_differences:
        avg_difference = np.mean(total_differences)
        max_difference = np.max(total_differences)
        
        print("IMPACT ANALYSIS:")
        print("-" * 40)
        print(f"Average absolute difference: {avg_difference:.1f} points")
        print(f"Maximum absolute difference: {max_difference:.1f} points")
        
        if max_difference < 5:
            print(" CONCLUSION: Memory size during evaluation has MINIMAL impact")
            print("   Original results and rankings remain valid")
        elif max_difference < 15:
            print("  CONCLUSION: Memory size has MODERATE impact")
            print("   Rankings might change slightly")
        else:
            print(" CONCLUSION: Memory size has SIGNIFICANT impact")
            print("   Original results should be discarded")
    
    # Updated ranking
    print()
    print("UPDATED PERFORMANCE RANKING:")
    print("-" * 40)
    
    valid_configs = {k: v['overall_mean'] for k, v in corrected_results.items() if v is not None}
    sorted_configs = sorted(valid_configs.items(), key=lambda x: x[1], reverse=True)
    
    for i, (config_name, performance) in enumerate(sorted_configs, 1):
        print(f"{i}. {config_name.title():<20}: {performance:.1f}")
    
    # Save corrected results
    save_corrected_results(corrected_results, original_results)
In [24]:
def save_corrected_results(corrected_results, original_results):
    """Save corrected evaluation results"""
    
    # Prepare data for JSON serialization
    json_data = {
        'original_results': original_results,
        'corrected_results': {},
        'comparison_summary': {}
    }
    
    for config_name, result in corrected_results.items():
        if result:
            # Convert numpy types
            corrected_result = {}
            for key, value in result.items():
                if isinstance(value, np.ndarray):
                    corrected_result[key] = value.tolist()
                elif isinstance(value, (np.float64, np.float32)):
                    corrected_result[key] = float(value)
                elif isinstance(value, (np.int64, np.int32)):
                    corrected_result[key] = int(value)
                else:
                    corrected_result[key] = value
            
            json_data['corrected_results'][config_name] = corrected_result
            
            # Add comparison
            if config_name in original_results:
                original = original_results[config_name]
                corrected = result['overall_mean']
                json_data['comparison_summary'][config_name] = {
                    'original': float(original),
                    'corrected': float(corrected),
                    'difference': float(corrected - original),
                    'impact': 'SIGNIFICANT' if abs(corrected - original) > 10 else 'MINOR'
                }
    
    with open("replay_memory_corrected_evaluation.json", "w") as f:
        json.dump(json_data, f, indent=2)
    
    print(f"\nCorrected results saved to 'replay_memory_corrected_evaluation.json'")
In [25]:
if __name__ == "__main__":
    # Run corrected re-evaluation
    corrected_results = run_corrected_replay_memory_evaluation()
CORRECTED REPLAY MEMORY RE-EVALUATION
Using MATCHING memory configurations for evaluation
================================================================================

CORRECTED EVALUATION 1/5: SMALL
Training Memory: 10,000 | Min: 500
----------------------------------------------------------------------

Corrected Evaluation: 21act_replay_small
Memory Size: 10,000 | Min Memory: 500
 Loaded weights from 21act_replay_small_weights.h5
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -188.5 ± 110.2
--- Run 2/5 ---
Run 2: -213.4 ± 138.4
--- Run 3/5 ---
Run 3: -143.6 ± 125.4
--- Run 4/5 ---
Run 4: -173.1 ± 115.7
--- Run 5/5 ---
Run 5: -214.3 ± 142.2

--- CORRECTED EVALUATION SUMMARY ---
Overall mean: -186.59
Run-to-run std: 26.51
95% CI: [-219.50, -153.67]
--------------------------------------------------

SMALL CORRECTED EVALUATION COMPLETED
======================================================================

CORRECTED EVALUATION 2/5: MEDIUM_LOW_MIN
Training Memory: 25,000 | Min: 1,000
----------------------------------------------------------------------

Corrected Evaluation: 21act_replay_medium_low_min
Memory Size: 25,000 | Min Memory: 1,000
 Loaded weights from 21act_replay_medium_low_min_weights.h5
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -185.9 ± 102.3
--- Run 2/5 ---
Run 2: -184.1 ± 117.9
--- Run 3/5 ---
Run 3: -183.4 ± 100.8
--- Run 4/5 ---
Run 4: -173.2 ± 81.6
--- Run 5/5 ---
Run 5: -127.0 ± 81.1

--- CORRECTED EVALUATION SUMMARY ---
Overall mean: -170.73
Run-to-run std: 22.29
95% CI: [-198.41, -143.04]
--------------------------------------------------

MEDIUM_LOW_MIN CORRECTED EVALUATION COMPLETED
======================================================================

CORRECTED EVALUATION 3/5: CURRENT_BASELINE
Training Memory: 50,000 | Min: 1,000
----------------------------------------------------------------------

Corrected Evaluation: 21act_replay_current_baseline
Memory Size: 50,000 | Min Memory: 1,000
 Loaded weights from 21act_replay_current_baseline_weights.h5
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -184.6 ± 93.5
--- Run 2/5 ---
Run 2: -167.8 ± 133.6
--- Run 3/5 ---
Run 3: -267.9 ± 314.6
--- Run 4/5 ---
Run 4: -164.9 ± 111.0
--- Run 5/5 ---
Run 5: -182.0 ± 92.6

--- CORRECTED EVALUATION SUMMARY ---
Overall mean: -193.43
Run-to-run std: 38.01
95% CI: [-240.63, -146.24]
--------------------------------------------------

CURRENT_BASELINE CORRECTED EVALUATION COMPLETED
======================================================================

CORRECTED EVALUATION 4/5: LARGE
Training Memory: 100,000 | Min: 2,000
----------------------------------------------------------------------

Corrected Evaluation: 21act_replay_large
Memory Size: 100,000 | Min Memory: 2,000
 Loaded weights from 21act_replay_large_weights.h5
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -134.2 ± 82.3
--- Run 2/5 ---
Run 2: -133.4 ± 71.5
--- Run 3/5 ---
Run 3: -184.6 ± 84.4
--- Run 4/5 ---
Run 4: -127.2 ± 80.4
--- Run 5/5 ---
Run 5: -121.7 ± 86.0

--- CORRECTED EVALUATION SUMMARY ---
Overall mean: -140.21
Run-to-run std: 22.64
95% CI: [-168.33, -112.10]
--------------------------------------------------

LARGE CORRECTED EVALUATION COMPLETED
======================================================================

CORRECTED EVALUATION 5/5: HIGH_MIN_THRESHOLD
Training Memory: 50,000 | Min: 5,000
----------------------------------------------------------------------

Corrected Evaluation: 21act_replay_high_min_threshold
Memory Size: 50,000 | Min Memory: 5,000
 Loaded weights from 21act_replay_high_min_threshold_weights.h5
Running 5 evaluation sessions of 20 episodes each
--- Run 1/5 ---
Run 1: -138.4 ± 85.8
--- Run 2/5 ---
Run 2: -151.4 ± 106.9
--- Run 3/5 ---
Run 3: -214.0 ± 119.4
--- Run 4/5 ---
Run 4: -190.9 ± 92.2
--- Run 5/5 ---
Run 5: -122.8 ± 86.5

--- CORRECTED EVALUATION SUMMARY ---
Overall mean: -163.48
Run-to-run std: 33.86
95% CI: [-205.52, -121.44]
--------------------------------------------------

HIGH_MIN_THRESHOLD CORRECTED EVALUATION COMPLETED
======================================================================

================================================================================
ORIGINAL vs CORRECTED EVALUATION COMPARISON
================================================================================

Config               Original     Corrected    Difference   Impact         
---------------------------------------------------------------------------
Small                -172.9       -186.6       -13.7        SIGNIFICANT    
Medium_Low_Min       -166.3       -170.7       -4.4         MINOR          
Current_Baseline     -176.2       -193.4       -17.2        SIGNIFICANT    
Large                -151.9       -140.2       +11.7        SIGNIFICANT    
High_Min_Threshold   -162.8       -163.5       -0.7         MINOR          

IMPACT ANALYSIS:
----------------------------------------
Average absolute difference: 9.5 points
Maximum absolute difference: 17.2 points
 CONCLUSION: Memory size has SIGNIFICANT impact
   Original results should be discarded

UPDATED PERFORMANCE RANKING:
----------------------------------------
1. Large               : -140.2
2. High_Min_Threshold  : -163.5
3. Medium_Low_Min      : -170.7
4. Small               : -186.6
5. Current_Baseline    : -193.4

Corrected results saved to 'replay_memory_corrected_evaluation.json'

Observations and Analysis ¶

Configuration Memory Size Min Memory Mean Reward 95% CI Relative Rank
Large 100,000 2,000 -140.21 [-168.33, -112.10] First
High Min Threshold 50,000 5,000 -163.48 [-205.52, -121.44] Second
Medium Low Min 25,000 1,000 -170.73 [-198.41, -143.04] Third
Small 10,000 500 -186.59 [-219.50, -153.67] Fourth
Current Baseline 50,000 1,000 -193.43 [-240.63, -146.24] Fifth

1. Replay size clearly impacts evaluation performance

  • The corrected results differ significantly from the originals.
  • Max difference: 17.2 points (Baseline config)
  • Conclusion: Evaluation is sensitive to the memory size used, so mismatch = misleading results.

2. Larger memory leads to better generalization

  • Large (100k) performed best.
  • Why? Likely because:
    • Larger buffer increases experience diversity.
    • Reduces overfitting to recent experiences.
    • More stable Q-value updates.

3. Baseline config surprisingly performed worst

  • Despite being original default (50k, 1k), it gave the lowest performance.
  • Possible reasons:
    • Too small min_memory (1k) might’ve allowed early training on a narrow slice of experiences.
    • May have started training before enough diverse samples were gathered.
  • This shows that my early training matters and a bad start = derail learning

4. High min threshold helped slightly

  • Same memory size (50k) as baseline, but a higher min_memory (5k).
  • Improved performance significantly over the baseline.
  • Suggests that delaying training until buffer is more filled gives better long-term results.

5. Small memory is fast but weak

  • 10k memory with 500 min = second worst performance.
  • Likely learned too fast from too little and too correlated data.
  • Still, it's more efficient per minute (as seen in original graphs), so could be useful in time-sensitive scenarios.

Final decision

  • We will be using 100 000 memory size with 2000 min memory

Hyperparameter tuning¶

We will be running 9 configurations in total

  1. Baseline – our control setup using previously optimized parameters.

  2. High learning rate – tests if faster weight updates accelerate learning.

  3. Low learning rate – tests if slower updates improve stability.

  4. Large batch size – tests if more stable gradients improve performance.

  5. Small batch size – tests if more frequent updates help learning speed.

  6. High gamma – favors long-term reward optimization.

  7. Low gamma – favors short-term reward optimization.

  8. Frequent target updates – checks if updating the target network more often improves convergence.

  9. Rare target updates – checks if less frequent updates give more stable targets.

The purpose of these 9 is to vary one parameter at a time from the baseline so we can isolate the effect of each change. This helps identify which parameters have the largest influence without the results being confounded by multiple simultaneous changes.

Different from the usual methods like GridSearchCV which I initially thought I should be doing

  • Grid search tries all parameter combinations over a dataset that doesn’t change.

  • In reinforcement learning, the “data” changes during training, so results can vary widely.

  • Our method uses controlled experiments and robust multi-run evaluation to handle RL’s randomness

Why we did not run much larger set of configurations?

  • Computational cost – each run involves 600 episodes of training plus multiple evaluation runs, taking hours per configuration.

  • Clarity – running too many configs at once makes it harder to trace performance changes to a specific parameter.

  • Focus – by testing only impactful parameters (learning rate, batch size, gamma, target update frequency) while keeping already-optimized settings fixed (n_actions, memory size, etc.), we maximize the insight per run.

In [48]:
def train_hyperparameter_experiment(n_actions, hyperparam_config, experiment_prefix):
    """Train with different hyperparameter configurations using AdvancedDQNAgent"""
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    
    # FIXED OPTIMIZED SETTINGS (from previous research)
    REPLAY_MEMORY_SIZE = 100000      # optimal memory size
    MIN_REPLAY_MEMORY = 2000         # optimal min memory
    MAX_EPISODES = 600               # optimal episode count
    MAX_STEPS = 200
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"  # Use your optimized strategy
    
    # HYPERPARAMETERS TO TUNE (from config)
    LEARNING_RATE = hyperparam_config["learning_rate"]
    BATCH_SIZE = hyperparam_config["batch_size"]
    GAMMA = hyperparam_config["gamma"]
    TARGET_UPDATE_EVERY = hyperparam_config["target_update_every"]

    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"

    print("=" * 70)
    print(f"Hyperparameter Experiment: {hyperparam_config['name'].upper()}")
    print(f"LR: {LEARNING_RATE} | Batch: {BATCH_SIZE} | Gamma: {GAMMA} | Target Update: {TARGET_UPDATE_EVERY}")
    print(f"Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent")
    print("=" * 70)
    print()

    env = gym.make(ENV_NAME)
    
    # USE AdvancedDQNAgent WITH PLATEAU RESTART
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    print("Model Summary:")
    agent.summary()
    print()
    
    scores = []
    best_avg_reward = -np.inf
    episode_times = []
    epsilon_history = []
    training_steps = 0
    best_episode = 0
    
    start = time.time()

    for ep in range(1, MAX_EPISODES + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        
        # Ensure proper state shape
        s = np.array(s, dtype=np.float32)
        if s.shape != (3,):
            s = s.flatten()[:3]
            
        total_reward = 0
        episode_training_steps = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, n_actions)
            
            s_next, r, done, info = env.step([torque])
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            
            # Ensure proper next state shape
            s_next = np.array(s_next, dtype=np.float32)
            if s_next.shape != (3,):
                s_next = s_next.flatten()[:3]
            
            agent.remember(s, a_idx, r, s_next, done)
            
            # Train only if we have enough experiences
            if len(agent.memory) >= MIN_REPLAY_MEMORY:
                agent.train_step()
                training_steps += 1
                episode_training_steps += 1
            
            s = s_next
            total_reward += r
            if done:
                break

        scores.append(total_reward)
        
        # ADVANCED EPSILON DECAY (uses AdvancedDQNAgent logic)
        recent_performance = np.mean(scores[-10:]) if len(scores) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_performance)
        epsilon_history.append(agent.epsilon)
        
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints
        if ep % 150 == 0:
            agent.save(f"{experiment_prefix}_{ep}_weights.h5")
        
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        episode_times.append(ep_time)
        
        # Track best performance
        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            best_episode = ep
            agent.save(SAVE_WEIGHTS_PATH)
        
        # Progress reporting
        if ep <= 10 or ep % 50 == 0 or ep in [100, 200, 300, 400, 500, 600]:
            memory_pct = (len(agent.memory) / REPLAY_MEMORY_SIZE) * 100
            episodes_since_improvement = ep - agent.last_improvement_episode
            print(f"Episode {ep} | Reward: {total_reward:.2f} | Avg(10): {avg_reward:.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({memory_pct:.1f}%) | "
                  f"Steps: {episode_training_steps} | Time: {ep_time:.2f}s | "
                  f"Since Improv: {episodes_since_improvement}")

    env.close()
    total_time = time.time() - start
    avg_time_per_episode = total_time / MAX_EPISODES

    print()
    print("TRAINING COMPLETED")
    print(f"Episodes trained: {MAX_EPISODES}")
    print(f"Best episode: {best_episode}")
    print(f"Best average reward: {best_avg_reward:.2f}")
    print(f"Final epsilon: {agent.epsilon:.4f}")
    print(f"Total training steps: {training_steps:,}")
    print(f"Training time: {total_time:.2f}s ({avg_time_per_episode:.2f}s/ep)")
    print()

    # ROBUST EVALUATION with matching hyperparameters
    print("Evaluating trained model...")
    eval_results = evaluate_hyperparameter_robust(experiment_prefix, n_actions, hyperparam_config, 
                                                 num_episodes=20, num_runs=5)
    
    return {
        'config_name': hyperparam_config['name'],
        'hyperparameters': {
            'learning_rate': LEARNING_RATE,
            'batch_size': BATCH_SIZE,
            'gamma': GAMMA,
            'target_update_every': TARGET_UPDATE_EVERY
        },
        'epsilon_strategy': 'plateau_restart',
        'episodes_trained': MAX_EPISODES,
        'best_episode': best_episode,
        'best_training_reward': best_avg_reward,
        'eval_results': eval_results,
        'training_time': total_time,
        'time_per_episode': avg_time_per_episode,
        'total_training_steps': training_steps,
        'scores_history': scores,
        'epsilon_history': epsilon_history
    }

def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)
In [49]:
def evaluate_hyperparameter_robust(experiment_prefix, n_actions, hyperparam_config, num_episodes=20, num_runs=5):
    """Evaluation with EXACT training configuration for consistency"""
    
    INPUT_SHAPE = 3
    MAX_STEPS = 200
    
    # USE EXACT SAME CONFIGURATION AS TRAINING
    REPLAY_MEMORY_SIZE = 100000  # MUST match training
    MIN_REPLAY_MEMORY = 2000     # MUST match training
    BATCH_SIZE = hyperparam_config["batch_size"]  # MUST match training
    TARGET_UPDATE_EVERY = hyperparam_config["target_update_every"]  # MUST match training
    LEARNING_RATE = hyperparam_config["learning_rate"]  # MUST match training
    GAMMA = hyperparam_config["gamma"]  # MUST match training
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"  # MUST match training
    
    SAVE_WEIGHTS_PATH = f"{experiment_prefix}_weights.h5"
    
    print(f"\n Evaluating: {experiment_prefix}")
    print(f"Using EXACT training config: Memory={REPLAY_MEMORY_SIZE:,}, Batch={BATCH_SIZE}, LR={LEARNING_RATE}")
    
    # Create agent with IDENTICAL configuration (USE AdvancedDQNAgent)
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, n_actions, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    try:
        agent.load(SAVE_WEIGHTS_PATH)
        agent.epsilon = 0.0  # Force pure exploitation
        print(f" Loaded weights from {SAVE_WEIGHTS_PATH}")
    except FileNotFoundError:
        print(f" Weights file {SAVE_WEIGHTS_PATH} not found")
        return None
    
    print(f"Running {num_runs} runs × {num_episodes} episodes (epsilon=0.0)")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        run_rewards = []
        
        for ep in range(num_episodes):
            # Handle state properly for Pendulum-v0
            state = env.reset()
            if isinstance(state, tuple):
                state = state[0]
            state = np.array(state, dtype=np.float32)
            if state.shape != (3,):
                state = state.flatten()[:3]
            
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(state)
                torque = action_index_to_torque(a_idx, n_actions)
                
                # Pendulum-v0 expects single float, not array
                next_state, reward, done, info = env.step([torque])
                
                if isinstance(next_state, tuple):
                    next_state = next_state[0]
                next_state = np.array(next_state, dtype=np.float32)
                if next_state.shape != (3,):
                    next_state = next_state.flatten()[:3]
                
                total_reward += reward
                state = next_state
                
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Calculate overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # Confidence interval
    confidence_level = 0.95
    dof = len(all_means) - 1
    if dof > 0:
        t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
        margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
        ci_lower = overall_mean - margin_of_error
        ci_upper = overall_mean + margin_of_error
    else:
        ci_lower = ci_upper = overall_mean
    
    print(f"\n EVALUATION SUMMARY:")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print("-" * 50)
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means,
        'num_runs': num_runs,
        'num_episodes': num_episodes,
        'evaluation_config': {
            'memory_size': REPLAY_MEMORY_SIZE,
            'min_memory': MIN_REPLAY_MEMORY,
            'batch_size': BATCH_SIZE,
            'learning_rate': LEARNING_RATE,
            'gamma': GAMMA,
            'target_update_every': TARGET_UPDATE_EVERY,
            'epsilon_strategy': EPSILON_STRATEGY
        }
    }
In [50]:
def run_hyperparameter_exploration():
    """Run systematic hyperparameter exploration"""
    
    # Set seeds for reproducibility
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    # Hyperparameter configurations to test
    hyperparam_configs = [
        {
            "name": "baseline",
            "learning_rate": 3e-4,
            "batch_size": 64,
            "gamma": 0.99,
            "target_update_every": 5,
            "description": "Current baseline configuration"
        },
        {
            "name": "high_lr",
            "learning_rate": 1e-3,
            "batch_size": 64,
            "gamma": 0.99,
            "target_update_every": 5,
            "description": "Higher learning rate for faster learning"
        },
        {
            "name": "low_lr",
            "learning_rate": 1e-4,
            "batch_size": 64,
            "gamma": 0.99,
            "target_update_every": 5,
            "description": "Lower learning rate for stable learning"
        },
        {
            "name": "large_batch",
            "learning_rate": 3e-4,
            "batch_size": 128,
            "gamma": 0.99,
            "target_update_every": 5,
            "description": "Larger batch size for stable gradients"
        },
        {
            "name": "small_batch",
            "learning_rate": 3e-4,
            "batch_size": 32,
            "gamma": 0.99,
            "target_update_every": 5,
            "description": "Smaller batch size for frequent updates"
        },
        {
            "name": "high_gamma",
            "learning_rate": 3e-4,
            "batch_size": 64,
            "gamma": 0.995,
            "target_update_every": 5,
            "description": "Higher gamma for long-term rewards"
        },
        {
            "name": "low_gamma",
            "learning_rate": 3e-4,
            "batch_size": 64,
            "gamma": 0.95,
            "target_update_every": 5,
            "description": "Lower gamma for immediate rewards"
        },
        {
            "name": "frequent_target_update",
            "learning_rate": 3e-4,
            "batch_size": 64,
            "gamma": 0.99,
            "target_update_every": 10,
            "description": "Less frequent target updates for stability"
        },
        {
            "name": "rare_target_update",
            "learning_rate": 3e-4,
            "batch_size": 64,
            "gamma": 0.99,
            "target_update_every": 20,
            "description": "Rare target updates for consistency"
        }
    ]
    
    results = {}
    n_actions = 21  #optimized action space
    
    print("HYPERPARAMETER EXPLORATION")
    print("21 Actions | 600 Episodes | 100k Memory | 2k Min Memory | Plateau Restart")
    print("=" * 80)
    print()
    
    for i, config in enumerate(hyperparam_configs, 1):
        experiment_prefix = f"21act_hyperparam_{config['name']}"
        
        print(f"EXPERIMENT {i}/{len(hyperparam_configs)}: {config['name'].upper()}")
        print(f"Description: {config['description']}")
        print("-" * 70)
        
        results[config['name']] = train_hyperparameter_experiment(
            n_actions=n_actions,
            hyperparam_config=config,
            experiment_prefix=experiment_prefix
        )
        
        print(f"\n{config['name'].upper()} EXPERIMENT COMPLETED")
        print("=" * 70)
        print()
    
    # Analysis and comparison
    create_hyperparameter_analysis(results, hyperparam_configs)
    
    return results
In [51]:
def create_hyperparameter_analysis(results, hyperparam_configs):
    """Create comprehensive analysis of hyperparameter results"""
    
    print("\n" + "=" * 80)
    print("HYPERPARAMETER EXPLORATION RESULTS")
    print("=" * 80)
    
    # Extract results for comparison
    config_names = []
    train_rewards = []
    eval_means = []
    eval_stds = []
    eval_cis = []
    
    for config in hyperparam_configs:
        name = config['name']
        if name in results and results[name]['eval_results'] is not None:
            config_names.append(name)
            train_rewards.append(results[name]['best_training_reward'])
            eval_means.append(results[name]['eval_results']['overall_mean'])
            eval_stds.append(results[name]['eval_results']['overall_std'])
            eval_cis.append((results[name]['eval_results']['ci_lower'], 
                           results[name]['eval_results']['ci_upper']))
    
    # Create comparison table
    print("\nPERFORMANCE COMPARISON:")
    print("-" * 80)
    print(f"{'Config':<20} {'Training':<12} {'Eval Mean':<12} {'Eval Std':<12} {'95% CI':<20}")
    print("-" * 80)
    
    for i, name in enumerate(config_names):
        ci_str = f"[{eval_cis[i][0]:.1f}, {eval_cis[i][1]:.1f}]"
        print(f"{name:<20} {train_rewards[i]:<12.1f} {eval_means[i]:<12.1f} "
              f"{eval_stds[i]:<12.1f} {ci_str:<20}")
    
    # Find best configuration
    if eval_means:
        best_idx = np.argmax(eval_means)
        best_config = config_names[best_idx]
        best_score = eval_means[best_idx]
        
        print(f"\nBEST CONFIGURATION: {best_config.upper()}")
        print(f"Evaluation Score: {best_score:.2f}")
        print(f"95% CI: [{eval_cis[best_idx][0]:.2f}, {eval_cis[best_idx][1]:.2f}]")
        
        # Show hyperparameter values for best config
        best_result = results[best_config]
        print(f"\nBest Hyperparameters:")
        for param, value in best_result['hyperparameters'].items():
            print(f"  {param}: {value}")
    
    print("\n" + "=" * 80)
    
    return {
        'config_names': config_names,
        'eval_means': eval_means,
        'eval_stds': eval_stds,
        'best_config': best_config if eval_means else None,
        'best_score': best_score if eval_means else None
    }
In [52]:
if __name__ == "__main__":
    # Run hyperparameter exploration
    results = run_hyperparameter_exploration()
HYPERPARAMETER EXPLORATION
21 Actions | 600 Episodes | 100k Memory | 2k Min Memory | Plateau Restart
================================================================================

EXPERIMENT 1/9: BASELINE
Description: Current baseline configuration
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: BASELINE
LR: 0.0003 | Batch: 64 | Gamma: 0.99 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_12 (Dense)            multiple                  256       
                                                                 
 dense_13 (Dense)            multiple                  4160      
                                                                 
 dense_14 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1476.23 | Avg(10): -1476.23 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.04s | Since Improv: 1
Episode 2 | Reward: -972.80 | Avg(10): -1224.51 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.04s | Since Improv: 2
Episode 3 | Reward: -1683.06 | Avg(10): -1377.36 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.03s | Since Improv: 3
Episode 4 | Reward: -1191.69 | Avg(10): -1330.95 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.05s | Since Improv: 4
Episode 5 | Reward: -1107.46 | Avg(10): -1286.25 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.05s | Since Improv: 5
Episode 6 | Reward: -960.85 | Avg(10): -1232.02 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.05s | Since Improv: 6
Episode 7 | Reward: -1499.86 | Avg(10): -1270.28 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.05s | Since Improv: 7
Episode 8 | Reward: -770.75 | Avg(10): -1207.84 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode 9 | Reward: -1087.73 | Avg(10): -1194.49 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.07s | Since Improv: 9
Episode 10 | Reward: -1314.11 | Avg(10): -1206.45 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.17s | Since Improv: 10
Epsilon restart at episode 42: 0.814 → 0.300
Episode 50 | Reward: -888.02 | Avg(10): -1316.85 | ε: 0.288 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 7.84s | Since Improv: 0
Episode 100 | Reward: -967.06 | Avg(10): -916.19 | ε: 0.224 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 7.26s | Since Improv: 0
Episode 150 | Reward: -701.20 | Avg(10): -414.45 | ε: 0.175 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 8.07s | Since Improv: 0
Episode 200 | Reward: -2.01 | Avg(10): -169.95 | ε: 0.136 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 7.47s | Since Improv: 4
Episode 250 | Reward: -124.73 | Avg(10): -195.20 | ε: 0.106 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 7.89s | Since Improv: 7
Epsilon restart at episode 289: 0.087 → 0.300
Episode 300 | Reward: -126.00 | Avg(10): -246.21 | ε: 0.284 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 7.69s | Since Improv: 11
Epsilon restart at episode 309: 0.273 → 0.300
Episode 350 | Reward: -126.51 | Avg(10): -171.48 | ε: 0.244 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 5.35s | Since Improv: 0
Epsilon restart at episode 386: 0.205 → 0.300
Episode 400 | Reward: -242.12 | Avg(10): -244.09 | ε: 0.280 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 5.20s | Since Improv: 2
Episode 450 | Reward: -486.23 | Avg(10): -238.41 | ε: 0.218 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 5.63s | Since Improv: 0
Episode 500 | Reward: -123.21 | Avg(10): -169.81 | ε: 0.169 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.35s | Since Improv: 11
Episode 550 | Reward: -331.08 | Avg(10): -249.54 | ε: 0.132 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.40s | Since Improv: 6
Epsilon restart at episode 594: 0.106 → 0.300
Episode 600 | Reward: -121.21 | Avg(10): -168.02 | ε: 0.291 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.44s | Since Improv: 6

TRAINING COMPLETED
Episodes trained: 600
Best episode: 354
Best average reward: -84.94
Final epsilon: 0.2911
Total training steps: 118,001
Training time: 3990.59s (6.65s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_baseline
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
 Loaded weights from 21act_hyperparam_baseline_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -140.0 ± 96.4
--- Run 2/5 ---
Run 2: -199.8 ± 93.1
--- Run 3/5 ---
Run 3: -107.9 ± 79.6
--- Run 4/5 ---
Run 4: -159.2 ± 97.2
--- Run 5/5 ---
Run 5: -120.1 ± 99.5

 EVALUATION SUMMARY:
Overall mean: -145.38
Run-to-run std: 32.32
95% CI: [-185.51, -105.25]
--------------------------------------------------

BASELINE EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 2/9: HIGH_LR
Description: Higher learning rate for faster learning
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: HIGH_LR
LR: 0.001 | Batch: 64 | Gamma: 0.99 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_24 (Dense)            multiple                  256       
                                                                 
 dense_25 (Dense)            multiple                  4160      
                                                                 
 dense_26 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1323.13 | Avg(10): -1323.13 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Since Improv: 1
Episode 2 | Reward: -1696.77 | Avg(10): -1509.95 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.02s | Since Improv: 2
Episode 3 | Reward: -770.19 | Avg(10): -1263.36 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Since Improv: 3
Episode 4 | Reward: -1548.74 | Avg(10): -1334.71 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Since Improv: 4
Episode 5 | Reward: -1438.34 | Avg(10): -1355.44 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.02s | Since Improv: 5
Episode 6 | Reward: -1607.40 | Avg(10): -1397.43 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Since Improv: 6
Episode 7 | Reward: -1333.37 | Avg(10): -1388.28 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Since Improv: 7
Episode 8 | Reward: -967.29 | Avg(10): -1335.66 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode 9 | Reward: -1625.00 | Avg(10): -1367.81 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.02s | Since Improv: 9
Episode 10 | Reward: -922.86 | Avg(10): -1323.31 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.09s | Since Improv: 10
Episode 50 | Reward: -1089.60 | Avg(10): -1068.86 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 5.12s | Since Improv: 0
Episode 100 | Reward: -619.69 | Avg(10): -772.01 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 5.07s | Since Improv: 0
Episode 150 | Reward: -419.99 | Avg(10): -564.80 | ε: 0.471 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 5.34s | Since Improv: 0
Episode 200 | Reward: -372.95 | Avg(10): -373.43 | ε: 0.367 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.75s | Since Improv: 0
Episode 250 | Reward: -485.90 | Avg(10): -471.83 | ε: 0.286 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 5.37s | Since Improv: 2
Episode 300 | Reward: -369.78 | Avg(10): -420.16 | ε: 0.222 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 5.55s | Since Improv: 18
Epsilon restart at episode 302: 0.221 → 0.300
Epsilon restart at episode 326: 0.267 → 0.300
Episode 350 | Reward: -488.72 | Avg(10): -436.61 | ε: 0.266 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 5.32s | Since Improv: 0
Episode 400 | Reward: -492.57 | Avg(10): -470.30 | ε: 0.207 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 5.40s | Since Improv: 1
Epsilon restart at episode 419: 0.189 → 0.300
Episode 450 | Reward: -433.32 | Avg(10): -483.66 | ε: 0.257 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 5.34s | Since Improv: 3
Episode 500 | Reward: -699.01 | Avg(10): -468.42 | ε: 0.200 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.45s | Since Improv: 0
Episode 550 | Reward: -383.22 | Avg(10): -401.95 | ε: 0.156 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.41s | Since Improv: 0
Episode 600 | Reward: -251.86 | Avg(10): -336.56 | ε: 0.121 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 12.35s | Since Improv: 2

TRAINING COMPLETED
Episodes trained: 600
Best episode: 274
Best average reward: -187.73
Final epsilon: 0.1211
Total training steps: 118,001
Training time: 3252.67s (5.42s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_high_lr
Using EXACT training config: Memory=100,000, Batch=64, LR=0.001
 Loaded weights from 21act_hyperparam_high_lr_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -164.8 ± 67.6
--- Run 2/5 ---
Run 2: -189.1 ± 93.3
--- Run 3/5 ---
Run 3: -174.0 ± 82.9
--- Run 4/5 ---
Run 4: -159.0 ± 99.9
--- Run 5/5 ---
Run 5: -172.9 ± 87.3

 EVALUATION SUMMARY:
Overall mean: -171.95
Run-to-run std: 10.16
95% CI: [-184.57, -159.34]
--------------------------------------------------

HIGH_LR EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 3/9: LOW_LR
Description: Lower learning rate for stable learning
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: LOW_LR
LR: 0.0001 | Batch: 64 | Gamma: 0.99 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_36 (Dense)            multiple                  256       
                                                                 
 dense_37 (Dense)            multiple                  4160      
                                                                 
 dense_38 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1338.09 | Avg(10): -1338.09 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Since Improv: 1
Episode 2 | Reward: -1754.64 | Avg(10): -1546.36 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.02s | Since Improv: 2
Episode 3 | Reward: -1436.45 | Avg(10): -1509.72 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Since Improv: 3
Episode 4 | Reward: -1615.66 | Avg(10): -1536.21 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.02s | Since Improv: 4
Episode 5 | Reward: -1078.00 | Avg(10): -1444.57 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.03s | Since Improv: 5
Episode 6 | Reward: -1388.39 | Avg(10): -1435.20 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Since Improv: 6
Episode 7 | Reward: -964.05 | Avg(10): -1367.90 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Since Improv: 7
Episode 8 | Reward: -968.78 | Avg(10): -1318.01 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.03s | Since Improv: 8
Episode 9 | Reward: -904.91 | Avg(10): -1272.11 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.04s | Since Improv: 9
Episode 10 | Reward: -1807.47 | Avg(10): -1325.64 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.07s | Since Improv: 10
Epsilon restart at episode 20: 0.909 → 0.300
Epsilon restart at episode 40: 0.273 → 0.300
Episode 50 | Reward: -1513.93 | Avg(10): -1608.25 | ε: 0.285 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 6.12s | Since Improv: 10
Episode 100 | Reward: -1241.72 | Avg(10): -1219.77 | ε: 0.222 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 6.72s | Since Improv: 0
Episode 150 | Reward: -1320.97 | Avg(10): -1204.21 | ε: 0.173 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.58s | Since Improv: 0
Episode 200 | Reward: -900.40 | Avg(10): -1056.34 | ε: 0.135 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.29s | Since Improv: 0
Epsilon restart at episode 246: 0.107 → 0.300
Episode 250 | Reward: -918.97 | Avg(10): -1005.90 | ε: 0.294 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 7.48s | Since Improv: 0
Episode 300 | Reward: -753.73 | Avg(10): -822.17 | ε: 0.229 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 6.43s | Since Improv: 3
Episode 350 | Reward: -751.45 | Avg(10): -723.61 | ε: 0.178 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 6.21s | Since Improv: 5
Episode 400 | Reward: -774.81 | Avg(10): -616.72 | ε: 0.139 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 6.20s | Since Improv: 0
Episode 450 | Reward: -616.40 | Avg(10): -615.05 | ε: 0.108 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 6.60s | Since Improv: 1
Episode 500 | Reward: -366.99 | Avg(10): -517.63 | ε: 0.084 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.95s | Since Improv: 4
Episode 550 | Reward: -128.47 | Avg(10): -246.12 | ε: 0.065 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.32s | Since Improv: 1
Episode 600 | Reward: -240.18 | Avg(10): -159.82 | ε: 0.051 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.39s | Since Improv: 2

TRAINING COMPLETED
Episodes trained: 600
Best episode: 563
Best average reward: -124.51
Final epsilon: 0.0509
Total training steps: 118,001
Training time: 3952.50s (6.59s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_low_lr
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0001
 Loaded weights from 21act_hyperparam_low_lr_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -130.3 ± 60.2
--- Run 2/5 ---
Run 2: -178.8 ± 80.5
--- Run 3/5 ---
Run 3: -154.1 ± 74.5
--- Run 4/5 ---
Run 4: -162.5 ± 87.9
--- Run 5/5 ---
Run 5: -184.9 ± 97.9

 EVALUATION SUMMARY:
Overall mean: -162.14
Run-to-run std: 19.34
95% CI: [-186.15, -138.12]
--------------------------------------------------

LOW_LR EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 4/9: LARGE_BATCH
Description: Larger batch size for stable gradients
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: LARGE_BATCH
LR: 0.0003 | Batch: 128 | Gamma: 0.99 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_48 (Dense)            multiple                  256       
                                                                 
 dense_49 (Dense)            multiple                  4160      
                                                                 
 dense_50 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1817.98 | Avg(10): -1817.98 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.01s | Since Improv: 1
Episode 2 | Reward: -1252.98 | Avg(10): -1535.48 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.01s | Since Improv: 2
Episode 3 | Reward: -1423.81 | Avg(10): -1498.26 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.03s | Since Improv: 3
Episode 4 | Reward: -1555.19 | Avg(10): -1512.49 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Since Improv: 4
Episode 5 | Reward: -987.32 | Avg(10): -1407.46 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.04s | Since Improv: 5
Episode 6 | Reward: -1430.61 | Avg(10): -1411.31 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Since Improv: 6
Episode 7 | Reward: -1081.69 | Avg(10): -1364.23 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.04s | Since Improv: 7
Episode 8 | Reward: -1495.42 | Avg(10): -1380.62 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode 9 | Reward: -1638.99 | Avg(10): -1409.33 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.03s | Since Improv: 9
Episode 10 | Reward: -860.28 | Avg(10): -1354.43 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.10s | Since Improv: 10
Episode 50 | Reward: -782.91 | Avg(10): -1100.50 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 7.77s | Since Improv: 0
Episode 100 | Reward: -1029.41 | Avg(10): -1119.35 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 6.13s | Since Improv: 3
Episode 150 | Reward: -397.16 | Avg(10): -585.59 | ε: 0.471 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.71s | Since Improv: 0
Episode 200 | Reward: -513.98 | Avg(10): -353.58 | ε: 0.367 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.52s | Since Improv: 7
Episode 250 | Reward: -248.41 | Avg(10): -315.81 | ε: 0.286 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 7.08s | Since Improv: 3
Episode 300 | Reward: -234.86 | Avg(10): -183.22 | ε: 0.222 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 7.82s | Since Improv: 0
Episode 350 | Reward: -384.77 | Avg(10): -172.59 | ε: 0.173 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 6.51s | Since Improv: 9
Epsilon restart at episode 376: 0.153 → 0.300
Episode 400 | Reward: -120.12 | Avg(10): -218.92 | ε: 0.266 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 7.75s | Since Improv: 0
Episode 450 | Reward: -114.71 | Avg(10): -225.40 | ε: 0.207 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 6.72s | Since Improv: 4
Epsilon restart at episode 466: 0.192 → 0.300
Epsilon restart at episode 496: 0.259 → 0.300
Episode 500 | Reward: -252.75 | Avg(10): -205.56 | ε: 0.294 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.78s | Since Improv: 4
Episode 550 | Reward: -127.53 | Avg(10): -200.07 | ε: 0.229 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.63s | Since Improv: 16
Epsilon restart at episode 554: 0.225 → 0.300
Episode 600 | Reward: -252.94 | Avg(10): -299.10 | ε: 0.238 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.54s | Since Improv: 18

TRAINING COMPLETED
Episodes trained: 600
Best episode: 525
Best average reward: -121.67
Final epsilon: 0.2382
Total training steps: 118,001
Training time: 4003.09s (6.67s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_large_batch
Using EXACT training config: Memory=100,000, Batch=128, LR=0.0003
 Loaded weights from 21act_hyperparam_large_batch_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -120.9 ± 89.2
--- Run 2/5 ---
Run 2: -165.9 ± 87.1
--- Run 3/5 ---
Run 3: -152.4 ± 63.1
--- Run 4/5 ---
Run 4: -157.8 ± 92.4
--- Run 5/5 ---
Run 5: -173.0 ± 77.9

 EVALUATION SUMMARY:
Overall mean: -154.00
Run-to-run std: 17.96
95% CI: [-176.31, -131.70]
--------------------------------------------------

LARGE_BATCH EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 5/9: SMALL_BATCH
Description: Smaller batch size for frequent updates
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: SMALL_BATCH
LR: 0.0003 | Batch: 32 | Gamma: 0.99 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_20"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_60 (Dense)            multiple                  256       
                                                                 
 dense_61 (Dense)            multiple                  4160      
                                                                 
 dense_62 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -897.87 | Avg(10): -897.87 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Since Improv: 1
Episode 2 | Reward: -969.33 | Avg(10): -933.60 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.02s | Since Improv: 2
Episode 3 | Reward: -1320.51 | Avg(10): -1062.57 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.03s | Since Improv: 3
Episode 4 | Reward: -1618.65 | Avg(10): -1201.59 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Since Improv: 4
Episode 5 | Reward: -1017.29 | Avg(10): -1164.73 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.03s | Since Improv: 5
Episode 6 | Reward: -1562.22 | Avg(10): -1230.98 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.04s | Since Improv: 6
Episode 7 | Reward: -972.48 | Avg(10): -1194.05 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Since Improv: 7
Episode 8 | Reward: -1064.20 | Avg(10): -1177.82 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.03s | Since Improv: 8
Episode 9 | Reward: -1542.69 | Avg(10): -1218.36 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.05s | Since Improv: 9
Episode 10 | Reward: -1436.57 | Avg(10): -1240.18 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.12s | Since Improv: 10
Epsilon restart at episode 20: 0.909 → 0.300
Episode 50 | Reward: -1455.56 | Avg(10): -1197.85 | ε: 0.258 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 7.13s | Since Improv: 0
Episode 100 | Reward: -1206.13 | Avg(10): -966.80 | ε: 0.201 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 6.11s | Since Improv: 0
Episode 150 | Reward: -386.36 | Avg(10): -244.77 | ε: 0.156 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.45s | Since Improv: 0
Episode 200 | Reward: -779.85 | Avg(10): -290.73 | ε: 0.122 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 7.00s | Since Improv: 6
Episode 250 | Reward: -355.01 | Avg(10): -194.66 | ε: 0.095 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 6.52s | Since Improv: 0
Episode 300 | Reward: -123.92 | Avg(10): -178.25 | ε: 0.074 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 6.65s | Since Improv: 3
Episode 350 | Reward: -0.81 | Avg(10): -149.12 | ε: 0.057 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 7.37s | Since Improv: 16
Episode 400 | Reward: -125.95 | Avg(10): -124.33 | ε: 0.050 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 6.46s | Since Improv: 0
Epsilon restart at episode 423: 0.050 → 0.300
Epsilon restart at episode 443: 0.273 → 0.300
Episode 450 | Reward: -393.22 | Avg(10): -258.98 | ε: 0.290 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 6.10s | Since Improv: 7
Epsilon restart at episode 463: 0.273 → 0.300
Episode 500 | Reward: -2.53 | Avg(10): -283.16 | ε: 0.249 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.01s | Since Improv: 13
Episode 550 | Reward: -245.57 | Avg(10): -200.87 | ε: 0.194 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.97s | Since Improv: 5
Episode 600 | Reward: -121.66 | Avg(10): -182.22 | ε: 0.151 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.13s | Since Improv: 8

TRAINING COMPLETED
Episodes trained: 600
Best episode: 391
Best average reward: -107.70
Final epsilon: 0.1510
Total training steps: 118,001
Training time: 3856.45s (6.43s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_small_batch
Using EXACT training config: Memory=100,000, Batch=32, LR=0.0003
 Loaded weights from 21act_hyperparam_small_batch_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -202.1 ± 400.7
--- Run 2/5 ---
Run 2: -282.0 ± 360.6
--- Run 3/5 ---
Run 3: -317.2 ± 524.0
--- Run 4/5 ---
Run 4: -339.1 ± 527.2
--- Run 5/5 ---
Run 5: -355.3 ± 510.8

 EVALUATION SUMMARY:
Overall mean: -299.16
Run-to-run std: 54.39
95% CI: [-366.69, -231.62]
--------------------------------------------------

SMALL_BATCH EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 6/9: HIGH_GAMMA
Description: Higher gamma for long-term rewards
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: HIGH_GAMMA
LR: 0.0003 | Batch: 64 | Gamma: 0.995 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_24"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_72 (Dense)            multiple                  256       
                                                                 
 dense_73 (Dense)            multiple                  4160      
                                                                 
 dense_74 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1082.51 | Avg(10): -1082.51 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Since Improv: 1
Episode 2 | Reward: -1575.52 | Avg(10): -1329.01 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.01s | Since Improv: 2
Episode 3 | Reward: -1295.09 | Avg(10): -1317.71 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.03s | Since Improv: 3
Episode 4 | Reward: -1480.98 | Avg(10): -1358.52 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.02s | Since Improv: 4
Episode 5 | Reward: -1580.57 | Avg(10): -1402.93 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.04s | Since Improv: 5
Episode 6 | Reward: -1708.43 | Avg(10): -1453.85 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Since Improv: 6
Episode 7 | Reward: -887.13 | Avg(10): -1372.89 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Since Improv: 7
Episode 8 | Reward: -1176.44 | Avg(10): -1348.33 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.02s | Since Improv: 8
Episode 9 | Reward: -1781.22 | Avg(10): -1396.43 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.02s | Since Improv: 9
Episode 10 | Reward: -1560.95 | Avg(10): -1412.88 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.11s | Since Improv: 10
Episode 50 | Reward: -884.89 | Avg(10): -1177.74 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 6.28s | Since Improv: 0
Episode 100 | Reward: -650.84 | Avg(10): -981.45 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 6.29s | Since Improv: 0
Episode 150 | Reward: -494.69 | Avg(10): -648.51 | ε: 0.471 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.46s | Since Improv: 0
Episode 200 | Reward: -838.00 | Avg(10): -294.27 | ε: 0.367 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.10s | Since Improv: 0
Episode 250 | Reward: -126.47 | Avg(10): -201.44 | ε: 0.286 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 6.53s | Since Improv: 0
Episode 300 | Reward: -120.50 | Avg(10): -143.98 | ε: 0.222 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 6.53s | Since Improv: 0
Episode 350 | Reward: -2.67 | Avg(10): -230.16 | ε: 0.173 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 6.27s | Since Improv: 8
Episode 400 | Reward: -4.84 | Avg(10): -119.78 | ε: 0.135 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 6.44s | Since Improv: 0
Epsilon restart at episode 441: 0.110 → 0.300
Episode 450 | Reward: -639.31 | Avg(10): -285.99 | ε: 0.287 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 6.42s | Since Improv: 4
Episode 500 | Reward: -15.87 | Avg(10): -266.04 | ε: 0.223 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.71s | Since Improv: 16
Epsilon restart at episode 504: 0.220 → 0.300
Episode 550 | Reward: -235.12 | Avg(10): -295.21 | ε: 0.238 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.55s | Since Improv: 13
Epsilon restart at episode 557: 0.231 → 0.300
Episode 600 | Reward: -126.69 | Avg(10): -147.37 | ε: 0.242 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.53s | Since Improv: 0

TRAINING COMPLETED
Episodes trained: 600
Best episode: 411
Best average reward: -76.38
Final epsilon: 0.2418
Total training steps: 118,001
Training time: 3861.31s (6.44s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_high_gamma
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
 Loaded weights from 21act_hyperparam_high_gamma_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -149.4 ± 88.6
--- Run 2/5 ---
Run 2: -102.6 ± 77.5
--- Run 3/5 ---
Run 3: -96.3 ± 69.8
--- Run 4/5 ---
Run 4: -175.9 ± 63.9
--- Run 5/5 ---
Run 5: -142.6 ± 95.1

 EVALUATION SUMMARY:
Overall mean: -133.35
Run-to-run std: 29.91
95% CI: [-170.49, -96.21]
--------------------------------------------------

HIGH_GAMMA EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 7/9: LOW_GAMMA
Description: Lower gamma for immediate rewards
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: LOW_GAMMA
LR: 0.0003 | Batch: 64 | Gamma: 0.95 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_28"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_84 (Dense)            multiple                  256       
                                                                 
 dense_85 (Dense)            multiple                  4160      
                                                                 
 dense_86 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1092.44 | Avg(10): -1092.44 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.03s | Since Improv: 1
Episode 2 | Reward: -1643.69 | Avg(10): -1368.06 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.01s | Since Improv: 2
Episode 3 | Reward: -1069.71 | Avg(10): -1268.61 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Since Improv: 3
Episode 4 | Reward: -1182.34 | Avg(10): -1247.04 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.02s | Since Improv: 4
Episode 5 | Reward: -819.84 | Avg(10): -1161.60 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.03s | Since Improv: 5
Episode 6 | Reward: -1598.19 | Avg(10): -1234.37 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Since Improv: 6
Episode 7 | Reward: -1422.51 | Avg(10): -1261.24 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Since Improv: 7
Episode 8 | Reward: -1617.52 | Avg(10): -1305.78 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode 9 | Reward: -976.51 | Avg(10): -1269.19 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.04s | Since Improv: 9
Episode 10 | Reward: -1394.81 | Avg(10): -1281.75 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.11s | Since Improv: 10
Epsilon restart at episode 20: 0.909 → 0.300
Episode 50 | Reward: -651.62 | Avg(10): -1080.83 | ε: 0.258 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 6.62s | Since Improv: 0
Episode 100 | Reward: -1.95 | Avg(10): -338.17 | ε: 0.201 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 6.55s | Since Improv: 0
Episode 150 | Reward: -247.92 | Avg(10): -274.49 | ε: 0.156 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.48s | Since Improv: 4
Epsilon restart at episode 166: 0.145 → 0.300
Episode 200 | Reward: -245.90 | Avg(10): -219.87 | ε: 0.253 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.50s | Since Improv: 19
Episode 250 | Reward: -129.18 | Avg(10): -112.91 | ε: 0.197 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 6.89s | Since Improv: 0
Episode 300 | Reward: -3.95 | Avg(10): -173.55 | ε: 0.153 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 6.87s | Since Improv: 0
Episode 350 | Reward: -121.97 | Avg(10): -172.50 | ε: 0.119 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 6.41s | Since Improv: 0
Episode 400 | Reward: -254.99 | Avg(10): -261.12 | ε: 0.093 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 6.35s | Since Improv: 12
Episode 450 | Reward: -116.82 | Avg(10): -159.32 | ε: 0.072 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 6.96s | Since Improv: 0
Epsilon restart at episode 484: 0.061 → 0.300
Episode 500 | Reward: -129.24 | Avg(10): -213.75 | ε: 0.277 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.90s | Since Improv: 0
Episode 550 | Reward: -256.25 | Avg(10): -241.71 | ε: 0.215 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.26s | Since Improv: 10
Episode 600 | Reward: -130.21 | Avg(10): -188.78 | ε: 0.168 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.37s | Since Improv: 5

TRAINING COMPLETED
Episodes trained: 600
Best episode: 249
Best average reward: -112.68
Final epsilon: 0.1677
Total training steps: 118,001
Training time: 3906.60s (6.51s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_low_gamma
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
 Loaded weights from 21act_hyperparam_low_gamma_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -400.8 ± 556.0
--- Run 2/5 ---
Run 2: -495.6 ± 606.0
--- Run 3/5 ---
Run 3: -320.7 ± 465.4
--- Run 4/5 ---
Run 4: -142.1 ± 98.7
--- Run 5/5 ---
Run 5: -264.1 ± 338.9

 EVALUATION SUMMARY:
Overall mean: -324.67
Run-to-run std: 120.01
95% CI: [-473.68, -175.65]
--------------------------------------------------

LOW_GAMMA EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 8/9: FREQUENT_TARGET_UPDATE
Description: Less frequent target updates for stability
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: FREQUENT_TARGET_UPDATE
LR: 0.0003 | Batch: 64 | Gamma: 0.99 | Target Update: 10
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_32"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_96 (Dense)            multiple                  256       
                                                                 
 dense_97 (Dense)            multiple                  4160      
                                                                 
 dense_98 (Dense)            multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1145.29 | Avg(10): -1145.29 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.01s | Since Improv: 1
Episode 2 | Reward: -1277.74 | Avg(10): -1211.52 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.02s | Since Improv: 2
Episode 3 | Reward: -990.27 | Avg(10): -1137.77 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Since Improv: 3
Episode 4 | Reward: -1261.05 | Avg(10): -1168.59 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Since Improv: 4
Episode 5 | Reward: -982.12 | Avg(10): -1131.29 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.06s | Since Improv: 5
Episode 6 | Reward: -1174.40 | Avg(10): -1138.48 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Since Improv: 6
Episode 7 | Reward: -1406.70 | Avg(10): -1176.80 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.02s | Since Improv: 7
Episode 8 | Reward: -1451.95 | Avg(10): -1211.19 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode 9 | Reward: -1168.91 | Avg(10): -1206.49 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.03s | Since Improv: 9
Episode 10 | Reward: -1164.96 | Avg(10): -1202.34 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.09s | Since Improv: 10
Episode 50 | Reward: -965.96 | Avg(10): -1299.69 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 6.36s | Since Improv: 19
Epsilon restart at episode 51: 0.778 → 0.300
Episode 100 | Reward: -532.72 | Avg(10): -960.12 | ε: 0.235 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 6.05s | Since Improv: 0
Episode 150 | Reward: -1050.13 | Avg(10): -888.72 | ε: 0.183 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 6.68s | Since Improv: 12
Epsilon restart at episode 158: 0.176 → 0.300
Episode 200 | Reward: -239.07 | Avg(10): -593.14 | ε: 0.243 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.22s | Since Improv: 0
Episode 250 | Reward: -1.32 | Avg(10): -337.35 | ε: 0.189 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 6.52s | Since Improv: 0
Episode 300 | Reward: -121.93 | Avg(10): -172.47 | ε: 0.147 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 6.56s | Since Improv: 17
Episode 350 | Reward: -234.21 | Avg(10): -170.49 | ε: 0.115 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 6.15s | Since Improv: 12
Epsilon restart at episode 381: 0.099 → 0.300
Episode 400 | Reward: -242.04 | Avg(10): -366.14 | ε: 0.273 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 5.35s | Since Improv: 19
Epsilon restart at episode 401: 0.273 → 0.300
Episode 450 | Reward: -246.62 | Avg(10): -160.48 | ε: 0.235 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 7.19s | Since Improv: 0
Episode 500 | Reward: -376.46 | Avg(10): -207.55 | ε: 0.183 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.30s | Since Improv: 0
Episode 550 | Reward: -358.65 | Avg(10): -241.22 | ε: 0.142 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.18s | Since Improv: 2
Episode 600 | Reward: -118.36 | Avg(10): -178.53 | ε: 0.111 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.32s | Since Improv: 13

TRAINING COMPLETED
Episodes trained: 600
Best episode: 584
Best average reward: -118.34
Final epsilon: 0.1106
Total training steps: 118,001
Training time: 3965.90s (6.61s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_frequent_target_update
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
 Loaded weights from 21act_hyperparam_frequent_target_update_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -151.3 ± 63.6
--- Run 2/5 ---
Run 2: -146.6 ± 97.0
--- Run 3/5 ---
Run 3: -147.1 ± 93.0
--- Run 4/5 ---
Run 4: -172.2 ± 91.1
--- Run 5/5 ---
Run 5: -143.2 ± 85.5

 EVALUATION SUMMARY:
Overall mean: -152.08
Run-to-run std: 10.37
95% CI: [-164.95, -139.20]
--------------------------------------------------

FREQUENT_TARGET_UPDATE EXPERIMENT COMPLETED
======================================================================

EXPERIMENT 9/9: RARE_TARGET_UPDATE
Description: Rare target updates for consistency
----------------------------------------------------------------------
======================================================================
Hyperparameter Experiment: RARE_TARGET_UPDATE
LR: 0.0003 | Batch: 64 | Gamma: 0.99 | Target Update: 20
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:

Model Summary:
Model: "dqn_36"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_108 (Dense)           multiple                  256       
                                                                 
 dense_109 (Dense)           multiple                  4160      
                                                                 
 dense_110 (Dense)           multiple                  1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -1061.53 | Avg(10): -1061.53 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Since Improv: 1
Episode 2 | Reward: -967.95 | Avg(10): -1014.74 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.02s | Since Improv: 2
Episode 3 | Reward: -914.00 | Avg(10): -981.16 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.04s | Since Improv: 3
Episode 4 | Reward: -1220.28 | Avg(10): -1040.94 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.08s | Since Improv: 4
Episode 5 | Reward: -934.96 | Avg(10): -1019.74 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.04s | Since Improv: 5
Episode 6 | Reward: -1084.41 | Avg(10): -1030.52 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.05s | Since Improv: 6
Episode 7 | Reward: -1074.91 | Avg(10): -1036.86 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.05s | Since Improv: 7
Episode 8 | Reward: -1645.75 | Avg(10): -1112.97 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode 9 | Reward: -1357.33 | Avg(10): -1140.12 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.05s | Since Improv: 9
Episode 10 | Reward: -1135.09 | Avg(10): -1139.62 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.12s | Since Improv: 10
Epsilon restart at episode 20: 0.909 → 0.300
Epsilon restart at episode 40: 0.273 → 0.300
Episode 50 | Reward: -1326.26 | Avg(10): -1556.83 | ε: 0.285 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 7.01s | Since Improv: 0
Episode 100 | Reward: -1416.52 | Avg(10): -1351.10 | ε: 0.222 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 7.11s | Since Improv: 0
Episode 150 | Reward: -1067.61 | Avg(10): -987.45 | ε: 0.173 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 7.43s | Since Improv: 0
Episode 200 | Reward: -857.90 | Avg(10): -469.57 | ε: 0.135 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 9.12s | Since Improv: 0
Episode 250 | Reward: -132.08 | Avg(10): -289.10 | ε: 0.105 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 8.78s | Since Improv: 14
Episode 300 | Reward: -264.83 | Avg(10): -333.52 | ε: 0.081 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 9.35s | Since Improv: 6
Epsilon restart at episode 314: 0.076 → 0.300
Episode 350 | Reward: -369.00 | Avg(10): -259.14 | ε: 0.250 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 7.86s | Since Improv: 5
Episode 400 | Reward: -240.68 | Avg(10): -251.82 | ε: 0.195 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 7.47s | Since Improv: 0
Episode 450 | Reward: -251.15 | Avg(10): -571.79 | ε: 0.152 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 8.49s | Since Improv: 4
Episode 500 | Reward: -366.09 | Avg(10): -279.55 | ε: 0.118 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 8.47s | Since Improv: 0
Episode 550 | Reward: -659.72 | Avg(10): -209.44 | ε: 0.092 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.18s | Since Improv: 3
Episode 600 | Reward: -238.48 | Avg(10): -218.36 | ε: 0.072 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.45s | Since Improv: 2

TRAINING COMPLETED
Episodes trained: 600
Best episode: 560
Best average reward: -112.31
Final epsilon: 0.0715
Total training steps: 118,001
Training time: 4701.63s (7.84s/ep)

Evaluating trained model...

 Evaluating: 21act_hyperparam_rare_target_update
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
 Loaded weights from 21act_hyperparam_rare_target_update_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -195.0 ± 92.7
--- Run 2/5 ---
Run 2: -171.5 ± 110.9
--- Run 3/5 ---
Run 3: -174.9 ± 93.3
--- Run 4/5 ---
Run 4: -178.3 ± 120.3
--- Run 5/5 ---
Run 5: -144.4 ± 107.7

 EVALUATION SUMMARY:
Overall mean: -172.82
Run-to-run std: 16.35
95% CI: [-193.12, -152.52]
--------------------------------------------------

RARE_TARGET_UPDATE EXPERIMENT COMPLETED
======================================================================


================================================================================
HYPERPARAMETER EXPLORATION RESULTS
================================================================================

PERFORMANCE COMPARISON:
--------------------------------------------------------------------------------
Config               Training     Eval Mean    Eval Std     95% CI              
--------------------------------------------------------------------------------
baseline             -84.9        -145.4       32.3         [-185.5, -105.3]    
high_lr              -187.7       -172.0       10.2         [-184.6, -159.3]    
low_lr               -124.5       -162.1       19.3         [-186.2, -138.1]    
large_batch          -121.7       -154.0       18.0         [-176.3, -131.7]    
small_batch          -107.7       -299.2       54.4         [-366.7, -231.6]    
high_gamma           -76.4        -133.3       29.9         [-170.5, -96.2]     
low_gamma            -112.7       -324.7       120.0        [-473.7, -175.7]    
frequent_target_update -118.3       -152.1       10.4         [-164.9, -139.2]    
rare_target_update   -112.3       -172.8       16.3         [-193.1, -152.5]    

BEST CONFIGURATION: HIGH_GAMMA
Evaluation Score: -133.35
95% CI: [-170.49, -96.21]

Best Hyperparameters:
  learning_rate: 0.0003
  batch_size: 64
  gamma: 0.995
  target_update_every: 5

================================================================================
In [53]:
def create_hyperparameter_performance_chart(results):
    """Create comprehensive performance comparison chart"""
    
    # Extract data
    configs = []
    eval_means = []
    eval_stds = []
    train_rewards = []
    colors = []
    
    # Define colors for different hyperparameter types
    color_map = {
        'baseline': '#2E86AB',      # Blue
        'high_lr': '#A23B72',       # Dark Red
        'low_lr': '#F18F01',        # Orange  
        'large_batch': '#C73E1D',   # Red
        'small_batch': '#8B0000',   # Dark Red (poor performance)
        'high_gamma': '#228B22',    # Green (best performance)
        'low_gamma': '#8B0000',     # Dark Red (poor performance)
        'frequent_target_update': '#DAA520',  # Gold
        'rare_target_update': '#CD853F'      # Peru
    }
    
    for config_name, result in results.items():
        if result['eval_results']:
            configs.append(config_name.replace('_', '\n').title())
            eval_means.append(result['eval_results']['overall_mean'])
            eval_stds.append(result['eval_results']['overall_std'])
            train_rewards.append(result['best_training_reward'])
            colors.append(color_map.get(config_name, '#666666'))
    
    # Create figure with subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
    
    # Plot 1: Evaluation Performance
    bars1 = ax1.bar(configs, eval_means, yerr=eval_stds, 
                    color=colors, alpha=0.8, capsize=5, edgecolor='black', linewidth=1)
    
    ax1.set_title('Hyperparameter Evaluation Performance\n(21 Actions, 600 Episodes, Plateau Restart)', 
                  fontsize=14, fontweight='bold')
    ax1.set_ylabel('Mean Reward ± Std Dev', fontsize=12)
    ax1.grid(True, alpha=0.3, axis='y')
    ax1.tick_params(axis='x', rotation=45)
    
    # Highlight best performer
    best_idx = np.argmax(eval_means)
    bars1[best_idx].set_edgecolor('gold')
    bars1[best_idx].set_linewidth(4)
    
    # Add value labels on bars
    for bar, mean, std in zip(bars1, eval_means, eval_stds):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + std + 5,
                f'{mean:.1f}±{std:.1f}', ha='center', va='bottom', 
                fontsize=10, fontweight='bold')
    
    # Plot 2: Training vs Evaluation Performance
    scatter = ax2.scatter(train_rewards, eval_means, c=colors, s=200, 
                         alpha=0.8, edgecolors='black', linewidth=2)
    
    # Add labels for each point
    for i, config in enumerate(configs):
        ax2.annotate(config, (train_rewards[i], eval_means[i]), 
                    xytext=(5, 5), textcoords='offset points', 
                    fontsize=9, fontweight='bold')
    
    ax2.set_xlabel('Best Training Reward (10-episode average)', fontsize=12)
    ax2.set_ylabel('Evaluation Performance', fontsize=12)
    ax2.set_title('Training vs Evaluation Performance\n(Overtraining Analysis)', 
                  fontsize=14, fontweight='bold')
    ax2.grid(True, alpha=0.3)
    
    # Add diagonal line for reference
    min_val = min(min(train_rewards), min(eval_means))
    max_val = max(max(train_rewards), max(eval_means))
    ax2.plot([min_val, max_val], [min_val, max_val], 'r--', alpha=0.5, 
             label='Perfect Training-Eval Match')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig("hyperparameter_performance_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
In [54]:
def create_hyperparameter_heatmap(results):
    """Create heatmap showing individual hyperparameter effects"""
    
    # Extract hyperparameter values and performance
    param_analysis = {
        'learning_rate': {},
        'batch_size': {},
        'gamma': {},
        'target_update_every': {}
    }
    
    for config_name, result in results.items():
        if result['eval_results']:
            performance = result['eval_results']['overall_mean']
            hyperparams = result['hyperparameters']
            
            for param, value in hyperparams.items():
                if value not in param_analysis[param]:
                    param_analysis[param][value] = []
                param_analysis[param][value].append(performance)
    
    # Create summary matrix
    param_names = []
    param_values = []
    param_performance = []
    
    for param, value_dict in param_analysis.items():
        for value, performances in value_dict.items():
            param_names.append(f"{param}\n{value}")
            param_performance.append(np.mean(performances))
    
    # Create horizontal bar chart
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Color code by performance
    colors = plt.cm.RdYlGn((np.array(param_performance) - min(param_performance)) / 
                          (max(param_performance) - min(param_performance)))
    
    bars = ax.barh(param_names, param_performance, color=colors, alpha=0.8, 
                   edgecolor='black', linewidth=1)
    
    ax.set_xlabel('Mean Evaluation Performance', fontsize=12)
    ax.set_title('Individual Hyperparameter Impact Analysis', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='x')
    
    # Add value labels
    for bar, perf in zip(bars, param_performance):
        width = bar.get_width()
        ax.text(width + 2, bar.get_y() + bar.get_height()/2, 
                f'{perf:.1f}', ha='left', va='center', fontweight='bold')
    
    plt.tight_layout()
    plt.savefig("hyperparameter_impact_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
In [55]:
def create_significance_analysis(results):
    """Create confidence interval comparison chart"""
    
    configs = []
    means = []
    ci_lowers = []
    ci_uppers = []
    
    for config_name, result in results.items():
        if result['eval_results']:
            configs.append(config_name.replace('_', ' ').title())
            eval_res = result['eval_results']
            means.append(eval_res['overall_mean'])
            ci_lowers.append(eval_res['ci_lower'])
            ci_uppers.append(eval_res['ci_upper'])
    
    # Sort by performance
    sorted_indices = np.argsort(means)[::-1]  # Descending order
    configs = [configs[i] for i in sorted_indices]
    means = [means[i] for i in sorted_indices]
    ci_lowers = [ci_lowers[i] for i in sorted_indices]
    ci_uppers = [ci_uppers[i] for i in sorted_indices]
    
    # Create confidence interval plot
    fig, ax = plt.subplots(figsize=(12, 10))
    
    y_pos = np.arange(len(configs))
    
    # Plot confidence intervals
    for i, (mean, ci_low, ci_up) in enumerate(zip(means, ci_lowers, ci_uppers)):
        color = 'green' if i == 0 else 'red' if mean < -200 else 'blue'
        ax.errorbar(mean, y_pos[i], xerr=[[mean-ci_low], [ci_up-mean]], 
                   fmt='o', color=color, capsize=5, capthick=2, markersize=8)
        
        # Add confidence interval values
        ax.text(ci_up + 5, y_pos[i], f'[{ci_low:.1f}, {ci_up:.1f}]', 
               va='center', fontsize=10)
    
    ax.set_yticks(y_pos)
    ax.set_yticklabels(configs)
    ax.set_xlabel('Performance (with 95% Confidence Intervals)', fontsize=12)
    ax.set_title('Statistical Significance Analysis\nNon-overlapping CIs indicate significant differences', 
                 fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='x')
    
    # Add baseline reference line
    baseline_mean = next(mean for i, mean in enumerate(means) 
                        if 'baseline' in configs[i].lower())
    ax.axvline(x=baseline_mean, color='orange', linestyle='--', alpha=0.7, 
               label='Baseline Performance')
    ax.legend()
    
    plt.tight_layout()
    plt.savefig("hyperparameter_significance_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
In [58]:
def create_complete_hyperparameter_analysis(results):
    """Create comprehensive hyperparameter analysis with all visualizations"""
    
    # Create all three visualizations
    create_hyperparameter_performance_chart(results)
    create_hyperparameter_heatmap(results) 
    create_significance_analysis(results)
    
    # Create summary insights
    print("\n" + "="*60)
    print("HYPERPARAMETER INSIGHTS FROM VISUALIZATIONS")
    print("="*60)
    
    # Find best and worst performers
    eval_means = {name: result['eval_results']['overall_mean'] 
                  for name, result in results.items() if result['eval_results']}
    
    best_config = max(eval_means.keys(), key=lambda x: eval_means[x])
    worst_config = min(eval_means.keys(), key=lambda x: eval_means[x])
    
    print(f"BEST: {best_config.upper()} ({eval_means[best_config]:.1f})")
    print(f"WORST: {worst_config.upper()} ({eval_means[worst_config]:.1f})")
    print(f"IMPROVEMENT: {eval_means[best_config] - eval_means[worst_config]:.1f} points")
    
    return {
        'best_config': best_config,
        'worst_config': worst_config,
        'performance_range': eval_means[best_config] - eval_means[worst_config]
    }
In [59]:
create_complete_hyperparameter_analysis(results)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
============================================================
HYPERPARAMETER INSIGHTS FROM VISUALIZATIONS
============================================================
BEST: HIGH_GAMMA (-133.3)
WORST: LOW_GAMMA (-324.7)
IMPROVEMENT: 191.3 points
Out[59]:
{'best_config': 'high_gamma',
 'worst_config': 'low_gamma',
 'performance_range': 191.32073872482596}

Observations and analysis ¶

    1. Best vs Worst Configurations
  • Best: High Gamma (γ = 0.995)

    • Evaluation Mean: -133.3, 95% CI: (-170.5, -96.2)
    • This is the highest-performing configuration, outperforming the baseline (-145.4) by ~12 points.
    • The improvement over the worst case (Low Gamma) is ~191 points.
  • Worst: Low Gamma (γ = 0.95)

    • Evaluation Mean: -324.7, 95% CI: (-473.7, -175.7)
    • Extremely poor performance compared to all other configurations — likely due to the agent discounting future rewards too aggressively, which causes it to focus on short-term gains and miss long-term strategies.

This matches RL theory — higher γ values usually encourage more future-oriented strategies, which can be beneficial in environments with long-term dependencies. A very low γ often causes short-sighted policies.


  1. Learning Rate Trends
  • High LR (-172.0) and Low LR (-162.1) both underperform compared to the baseline.
  • The best LR appears in the High Gamma config (0.0003), suggesting that moderate LR is best here.
  • Expected?
    • Yes — too high of a learning rate causes instability, and too low makes learning too slow.
    • Initially chosen best config balances both.

  1. Batch Size Effects
  • Small Batch (32): Massive drop in performance (-299.2).
    • Likely due to noisy gradient estimates and unstable updates.
  • Large Batch (128): Performs well (-154.0), slightly better than baseline.
  • Medium Batch (64): Found in best config — suggesting it balances variance reduction with generalization.
  • Expected?
    • Yes — too small a batch increases variance, too large may overfit or slow updates. Middle ground works best.

  1. Target Update Frequency
  • Frequent Update (every 5) performs better (-152.1) than Rare Update (every 20) (-172.8).
  • Frequent updates help the target network stay relevant, reducing lag in Q-value estimates.
  • However, too frequent can increase correlation between target and policy networks, which can hurt — but in your case, the benefit seems to outweigh the risk.
  • Expected?
    • Yes, especially in shorter training runs — stale targets hurt performance.

Overtraining Analysis

  • The Training vs Evaluation plot shows overfitting in some configs — e.g., Low Gamma and Small Batch have relatively better training rewards but much worse evaluation rewards.

  • High Gamma and Baseline are closer to the "perfect match" line, showing better generalization.

Statistical Significance

  • Low Gamma and Small Batch have non-overlapping confidence intervals with baseline — meaning their underperformance is statistically significant.
  • High Gamma overlaps baseline slightly but still trends better.
  • Most other differences are smaller and could be due to noise.

Final hyperparameters used ¶

  1. N_Actions = 21
  2. Episodes = 600
  3. Epsilon Strayegy = 'Pleteau restart
  4. Memory Size = 100 000
  5. Min Memory = 2000
  6. Learning rate = 3e-4
  7. Batch Size = 64
  8. Gamma = 0.995
  9. Target update every = 5

Running optimised model¶

In [11]:
# Set seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

class DQNAgent:
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, epsilon_decay):
        
        self.input_shape = input_shape
        self.n_actions = n_actions
        self.gamma = gamma
        self.replay_memory_size = replay_memory_size
        self.min_replay_memory = min_replay_memory
        self.batch_size = batch_size
        self.target_update_every = target_update_every
        self.learning_rate = learning_rate
        self.epsilon = epsilon_start
        self.epsilon_start = epsilon_start
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        
        self.memory = deque(maxlen=replay_memory_size)
        self.target_update_counter = 0
        
        # Build networks
        self.main_network = self._build_network()
        self.target_network = self._build_network()
        self.update_target()
        
        # Optimizer
        self.optimizer = Adam(learning_rate=learning_rate)
    
    def _build_network(self):
        inputs = Input(shape=(self.input_shape,))
        x = Dense(64, activation='relu')(inputs)
        x = Dense(64, activation='relu')(x)
        outputs = Dense(self.n_actions, activation='linear')(x)
        return Model(inputs=inputs, outputs=outputs)
    
    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(0, self.n_actions)
        q_values = self.main_network(state.reshape(1, -1))
        return np.argmax(q_values[0])
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def train_step(self):
        if len(self.memory) < self.min_replay_memory:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        states = np.array([transition[0] for transition in batch])
        actions = np.array([transition[1] for transition in batch])
        rewards = np.array([transition[2] for transition in batch])
        next_states = np.array([transition[3] for transition in batch])
        dones = np.array([transition[4] for transition in batch])
        
        target_q_values = self.target_network(next_states)
        max_target_q_values = np.max(target_q_values, axis=1)
        targets = rewards + (self.gamma * max_target_q_values * (1 - dones))
        
        with tf.GradientTape() as tape:
            q_values = self.main_network(states, training=True)
            q_values_for_actions = tf.reduce_sum(q_values * tf.one_hot(actions, self.n_actions), axis=1)
            loss = tf.reduce_mean(tf.square(targets - q_values_for_actions))
        
        gradients = tape.gradient(loss, self.main_network.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.main_network.trainable_variables))
    
    def update_target(self):
        self.target_network.set_weights(self.main_network.get_weights())
    
    def save(self, filepath):
        self.main_network.save_weights(filepath)
    
    def load(self, filepath):
        self.main_network.load_weights(filepath)
        self.update_target()
    
    def summary(self):
        self.main_network.summary()
In [12]:
class AdvancedDQNAgent(DQNAgent):
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory, 
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, 
                 epsilon_decay, epsilon_strategy="linear"):
        
        super().__init__(input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                        batch_size, target_update_every, learning_rate, epsilon_start, 
                        epsilon_min, epsilon_decay)
        
        self.epsilon_strategy = epsilon_strategy
        self.epsilon_start = epsilon_start
        self.performance_history = deque(maxlen=50)
        self.last_improvement_episode = 0
        self.plateau_threshold = 20
        
    def adaptive_epsilon_decay(self, episode, recent_performance):
        """Adaptive epsilon based on learning progress"""
        
        if self.epsilon_strategy == "linear":
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "performance_based":
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    decay_rate = 0.998
                    self.last_improvement_episode = episode
                else:
                    decay_rate = 0.992
                    
                return max(self.epsilon_min, self.epsilon * decay_rate)
            else:
                return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
                
        elif self.epsilon_strategy == "plateau_restart":
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    self.last_improvement_episode = episode
                
                episodes_since_improvement = episode - self.last_improvement_episode
                if episodes_since_improvement >= self.plateau_threshold:
                    print(f"Epsilon restart at episode {episode}: {self.epsilon:.3f} → {self.epsilon_start * 0.3:.3f}")
                    self.epsilon = self.epsilon_start * 0.3
                    self.last_improvement_episode = episode
                    return self.epsilon
                    
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "high_exploration":
            epsilon_min_high = 0.15
            return max(epsilon_min_high, self.epsilon * 0.9995)
            
        else:
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def decay_epsilon_advanced(self, episode, recent_performance):
        """Advanced epsilon decay with strategy-specific logic"""
        self.epsilon = self.adaptive_epsilon_decay(episode, recent_performance)

def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)
In [13]:
def train_optimized_dqn():
    """Train using your final optimized hyperparameters"""
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    
    # FINAL OPTIMIZED HYPERPARAMETERS
    N_ACTIONS = 21
    MAX_EPISODES = 600
    MAX_STEPS = 200
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    LEARNING_RATE = 3e-4
    BATCH_SIZE = 64
    GAMMA = 0.995
    TARGET_UPDATE_EVERY = 5
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    SAVE_WEIGHTS_PATH = "optimized_dqn_weights.h5"

    print("=" * 70)
    print("OPTIMIZED DQN TRAINING")
    print(f"Researcher: gohyujie | Timestamp: 2025-08-09 10:40:05")
    print(f"Actions: {N_ACTIONS} | Episodes: {MAX_EPISODES} | Gamma: {GAMMA}")
    print(f"Memory: {REPLAY_MEMORY_SIZE:,} | Batch: {BATCH_SIZE} | LR: {LEARNING_RATE}")
    print(f"Epsilon Strategy: {EPSILON_STRATEGY.upper()}")
    print("=" * 70)
    print()

    env = gym.make(ENV_NAME)
    
    # Create AdvancedDQNAgent with plateau restart
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    print("AdvancedDQNAgent Model Summary:")
    agent.summary()
    print()
    
    scores = []
    best_avg_reward = -np.inf
    epsilon_history = []
    training_steps = 0
    best_episode = 0
    
    start = time.time()

    for ep in range(1, MAX_EPISODES + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        
        # Ensure proper state shape
        s = np.array(s, dtype=np.float32)
        if s.shape != (3,):
            s = s.flatten()[:3]
            
        total_reward = 0
        episode_training_steps = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, N_ACTIONS)
            
            s_next, r, done, info = env.step([torque])
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            
            # Ensure proper next state shape
            s_next = np.array(s_next, dtype=np.float32)
            if s_next.shape != (3,):
                s_next = s_next.flatten()[:3]
            
            agent.remember(s, a_idx, r, s_next, done)
            
            # Train only if we have enough experiences
            if len(agent.memory) >= MIN_REPLAY_MEMORY:
                agent.train_step()
                training_steps += 1
                episode_training_steps += 1
            
            s = s_next
            total_reward += r
            if done:
                break

        scores.append(total_reward)
        
        # Advanced epsilon decay using plateau restart
        recent_performance = np.mean(scores[-10:]) if len(scores) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_performance)
        epsilon_history.append(agent.epsilon)
        
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints
        if ep % 150 == 0:
            agent.save(f"optimized_dqn_{ep}_weights.h5")
        
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        
        # Track best performance
        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            best_episode = ep
            agent.save(SAVE_WEIGHTS_PATH)
        
        # Progress reporting
        if ep <= 10 or ep % 50 == 0 or ep in [100, 200, 300, 400, 500, 600]:
            memory_pct = (len(agent.memory) / REPLAY_MEMORY_SIZE) * 100
            episodes_since_improvement = ep - agent.last_improvement_episode
            print(f"Episode {ep:3d} | Reward: {total_reward:7.2f} | Avg(10): {avg_reward:7.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({memory_pct:.1f}%) | "
                  f"Steps: {episode_training_steps} | Time: {ep_time:.2f}s | "
                  f"Since Improv: {episodes_since_improvement}")

    env.close()
    total_time = time.time() - start
    avg_time_per_episode = total_time / MAX_EPISODES

    print()
    print("TRAINING COMPLETED")
    print(f"Episodes trained: {MAX_EPISODES}")
    print(f"Best episode: {best_episode}")
    print(f"Best average reward: {best_avg_reward:.2f}")
    print(f"Final epsilon: {agent.epsilon:.4f}")
    print(f"Total training steps: {training_steps:,}")
    print(f"Training time: {total_time:.2f}s ({avg_time_per_episode:.2f}s/ep)")
    print()
    
    return {
        'config_name': 'optimized_dqn',
        'hyperparameters': {
            'n_actions': N_ACTIONS,
            'episodes': MAX_EPISODES,
            'learning_rate': LEARNING_RATE,
            'batch_size': BATCH_SIZE,
            'gamma': GAMMA,
            'target_update_every': TARGET_UPDATE_EVERY,
            'replay_memory_size': REPLAY_MEMORY_SIZE,
            'min_replay_memory': MIN_REPLAY_MEMORY,
            'epsilon_strategy': EPSILON_STRATEGY
        },
        'training_results': {
            'episodes_trained': MAX_EPISODES,
            'best_episode': best_episode,
            'best_training_reward': best_avg_reward,
            'training_time': total_time,
            'time_per_episode': avg_time_per_episode,
            'total_training_steps': training_steps,
            'scores_history': scores,
            'epsilon_history': epsilon_history
        },
        'weights_path': SAVE_WEIGHTS_PATH
    }
In [124]:
def evaluate_optimized_dqn(num_episodes=20, num_runs=5):
    """Robust evaluation with epsilon=0 (pure exploitation)"""
    
    INPUT_SHAPE = 3
    MAX_STEPS = 200
    
    # EXACT SAME CONFIGURATION AS TRAINING
    N_ACTIONS = 21
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    SAVE_WEIGHTS_PATH = "optimized_dqn_weights.h5"
    
    print(f"\nEVALUATION PHASE")
    print(f"Using EXACT training config: Memory={REPLAY_MEMORY_SIZE:,}, Batch={BATCH_SIZE}, LR={LEARNING_RATE}")
    
    # Create agent with IDENTICAL configuration
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    try:
        agent.load(SAVE_WEIGHTS_PATH)
        agent.epsilon = 0.0  # Force pure exploitation
        print(f"Loaded weights from {SAVE_WEIGHTS_PATH}")
    except FileNotFoundError:
        print(f"ERROR: Weights file {SAVE_WEIGHTS_PATH} not found!")
        return None
    
    print(f"Running {num_runs} runs × {num_episodes} episodes (epsilon=0.0)")
    
    all_run_results = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        run_rewards = []
        
        for ep in range(num_episodes):
            state = env.reset()
            if isinstance(state, tuple):
                state = state[0]
            state = np.array(state, dtype=np.float32)
            if state.shape != (3,):
                state = state.flatten()[:3]
            
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(state)  # epsilon=0, pure exploitation
                torque = action_index_to_torque(a_idx, N_ACTIONS)
                
                next_state, reward, done, info = env.step([torque])
                
                if isinstance(next_state, tuple):
                    next_state = next_state[0]
                next_state = np.array(next_state, dtype=np.float32)
                if next_state.shape != (3,):
                    next_state = next_state.flatten()[:3]
                
                total_reward += reward
                state = next_state
                
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Calculate overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_means)
    overall_std = np.std(all_means)
    
    # 95% Confidence interval
    confidence_level = 0.95
    dof = len(all_means) - 1
    if dof > 0:
        t_critical = stats.t.ppf((1 + confidence_level) / 2, dof)
        margin_of_error = t_critical * (overall_std / np.sqrt(len(all_means)))
        ci_lower = overall_mean - margin_of_error
        ci_upper = overall_mean + margin_of_error
    else:
        ci_lower = ci_upper = overall_mean
    
    print(f"\nEVALUATION SUMMARY:")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Run-to-run std: {overall_std:.2f}")
    print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print("-" * 50)
    
    return {
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'run_means': all_means,
        'num_runs': num_runs,
        'num_episodes': num_episodes,
        'evaluation_config': {
            'memory_size': REPLAY_MEMORY_SIZE,
            'min_memory': MIN_REPLAY_MEMORY,
            'batch_size': BATCH_SIZE,
            'learning_rate': LEARNING_RATE,
            'gamma': GAMMA,
            'target_update_every': TARGET_UPDATE_EVERY,
            'epsilon_strategy': EPSILON_STRATEGY
        }
    }
In [125]:
def run_optimized_experiment():
    """Complete training and evaluation pipeline"""
    
    print("OPTIMIZED DQN EXPERIMENT")
    print("Using 5-Phase Research Optimized Hyperparameters")
    print(f"Researcher: gohyujie | Timestamp: 2025-08-09 10:40:05")
    print("=" * 80)
    print()
    
    # Training phase
    print("PHASE 1: TRAINING")
    print("-" * 40)
    training_results = train_optimized_dqn()
    
    # Evaluation phase
    print("\nPHASE 2: EVALUATION")
    print("-" * 40)
    eval_results = evaluate_optimized_dqn(num_episodes=20, num_runs=5)
    
    # Combine results
    final_results = {
        'timestamp': '2025-08-09 10:40:05',
        'researcher': 'gohyujie',
        'experiment_type': 'optimized_dqn_final',
        'training': training_results,
        'evaluation': eval_results
    }
    
    # Convert numpy types for JSON serialization
    def convert_numpy_types(obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        elif isinstance(obj, dict):
            return {key: convert_numpy_types(value) for key, value in obj.items()}
        elif isinstance(obj, list):
            return [convert_numpy_types(item) for item in obj]
        return obj
    
    final_results = convert_numpy_types(final_results)
    
    # Save results
    with open('optimized_dqn_results.json', 'w') as f:
        json.dump(final_results, f, indent=2)
    
    print(f"\nEXPERIMENT COMPLETED")
    print(f"Results saved to: optimized_dqn_results.json")
    print(f"Weights saved to: optimized_dqn_weights.h5")
    
    if eval_results:
        print(f"\nFINAL PERFORMANCE:")
        print(f"Training Best Avg: {training_results['training_results']['best_training_reward']:.2f}")
        print(f"Evaluation Mean: {eval_results['overall_mean']:.2f}")
        print(f"95% CI: [{eval_results['ci_lower']:.2f}, {eval_results['ci_upper']:.2f}]")
    
    return final_results
In [126]:
# Execute the experiment
if __name__ == "__main__":
    results = run_optimized_experiment()
OPTIMIZED DQN EXPERIMENT
Using 5-Phase Research Optimized Hyperparameters
Researcher: gohyujie | Timestamp: 2025-08-09 10:40:05
================================================================================

PHASE 1: TRAINING
----------------------------------------
======================================================================
OPTIMIZED DQN TRAINING
Researcher: gohyujie | Timestamp: 2025-08-09 10:40:05
Actions: 21 | Episodes: 600 | Gamma: 0.995
Memory: 100,000 | Batch: 64 | LR: 0.0003
Epsilon Strategy: PLATEAU_RESTART
======================================================================

AdvancedDQNAgent Model Summary:
Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_7 (InputLayer)        [(None, 3)]               0         
                                                                 
 dense_144 (Dense)           (None, 64)                256       
                                                                 
 dense_145 (Dense)           (None, 64)                4160      
                                                                 
 dense_146 (Dense)           (None, 21)                1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode   1 | Reward: -1532.43 | Avg(10): -1532.43 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.04s | Since Improv: 1
Episode   2 | Reward: -1501.37 | Avg(10): -1516.90 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.05s | Since Improv: 2
Episode   3 | Reward: -1421.02 | Avg(10): -1484.94 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Since Improv: 3
Episode   4 | Reward: -1367.78 | Avg(10): -1455.65 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.04s | Since Improv: 4
Episode   5 | Reward: -944.03 | Avg(10): -1353.32 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.05s | Since Improv: 5
Episode   6 | Reward: -888.07 | Avg(10): -1275.78 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.04s | Since Improv: 6
Episode   7 | Reward: -1167.89 | Avg(10): -1260.37 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.06s | Since Improv: 7
Episode   8 | Reward: -1179.74 | Avg(10): -1250.29 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.04s | Since Improv: 8
Episode   9 | Reward: -1359.04 | Avg(10): -1262.37 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.09s | Since Improv: 9
Episode  10 | Reward: -1659.32 | Avg(10): -1302.07 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.18s | Since Improv: 10
Episode  50 | Reward: -643.93 | Avg(10): -1076.32 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 5.35s | Since Improv: 0
Episode 100 | Reward: -1025.19 | Avg(10): -951.42 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 4.35s | Since Improv: 0
Epsilon restart at episode 134: 0.513 → 0.300
Episode 150 | Reward: -352.97 | Avg(10): -506.22 | ε: 0.277 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 4.41s | Since Improv: 0
Episode 200 | Reward: -126.12 | Avg(10): -183.98 | ε: 0.215 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 4.56s | Since Improv: 0
Episode 250 | Reward: -249.87 | Avg(10): -241.70 | ε: 0.168 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 4.52s | Since Improv: 9
Episode 300 | Reward: -337.24 | Avg(10): -129.60 | ε: 0.131 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 4.56s | Since Improv: 8
Epsilon restart at episode 328: 0.114 → 0.300
Episode 350 | Reward: -122.94 | Avg(10): -305.15 | ε: 0.269 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 4.41s | Since Improv: 9
Episode 400 | Reward: -373.83 | Avg(10): -171.11 | ε: 0.209 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 6.93s | Since Improv: 0
Episode 450 | Reward: -123.75 | Avg(10): -193.32 | ε: 0.163 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 6.17s | Since Improv: 12
Epsilon restart at episode 492: 0.133 → 0.300
Episode 500 | Reward: -124.34 | Avg(10): -158.92 | ε: 0.288 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.43s | Since Improv: 0
Episode 550 | Reward: -127.06 | Avg(10): -145.41 | ε: 0.224 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.56s | Since Improv: 10
Epsilon restart at episode 560: 0.214 → 0.300
Episode 600 | Reward: -240.19 | Avg(10): -220.81 | ε: 0.245 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.75s | Since Improv: 15

TRAINING COMPLETED
Episodes trained: 600
Best episode: 463
Best average reward: -95.41
Final epsilon: 0.2455
Total training steps: 118,001
Training time: 3073.91s (5.12s/ep)


PHASE 2: EVALUATION
----------------------------------------

EVALUATION PHASE
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
Loaded weights from optimized_dqn_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -133.9 ± 102.7
--- Run 2/5 ---
Run 2: -155.4 ± 102.7
--- Run 3/5 ---
Run 3: -197.1 ± 90.0
--- Run 4/5 ---
Run 4: -148.3 ± 98.1
--- Run 5/5 ---
Run 5: -171.3 ± 90.0

EVALUATION SUMMARY:
Overall mean: -161.17
Run-to-run std: 21.62
95% CI: [-188.02, -134.32]
--------------------------------------------------

EXPERIMENT COMPLETED
Results saved to: optimized_dqn_results.json
Weights saved to: optimized_dqn_weights.h5

FINAL PERFORMANCE:
Training Best Avg: -95.41
Evaluation Mean: -161.17
95% CI: [-188.02, -134.32]

Observations

  • Previous Hyperparameter Tuning (HIGH_GAMMA):

    • Training Best: -76.4
    • Evaluation Mean: -133.3
    • 95% CI: (-170.5, -96.2)
  • Current Standalone Run:

    • Training Best: -95.41 (20 points worse)
    • Evaluation Mean: -161.17 (28 points worse)
    • 95% CI: (-188.02, -134.32) (much worse range)
In [127]:
# Use this exact config that worked before
high_gamma_config = {
    "name": "high_gamma",
    "learning_rate": 3e-4,
    "batch_size": 64,
    "gamma": 0.995,
    "target_update_every": 5,
    "description": "Higher gamma for long-term rewards"
}

results = train_hyperparameter_experiment(
    n_actions=21,
    hyperparam_config=high_gamma_config,
    experiment_prefix="final_optimized"
)
======================================================================
Hyperparameter Experiment: HIGH_GAMMA
LR: 0.0003 | Batch: 64 | Gamma: 0.995 | Target Update: 5
Using PLATEAU RESTART epsilon strategy with AdvancedDQNAgent
======================================================================

Model Summary:
Model: "model_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_11 (InputLayer)       [(None, 3)]               0         
                                                                 
 dense_156 (Dense)           (None, 64)                256       
                                                                 
 dense_157 (Dense)           (None, 64)                4160      
                                                                 
 dense_158 (Dense)           (None, 21)                1365      
                                                                 
=================================================================
Total params: 5781 (22.58 KB)
Trainable params: 5781 (22.58 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Episode 1 | Reward: -876.18 | Avg(10): -876.18 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Since Improv: 1
Episode 2 | Reward: -1173.89 | Avg(10): -1025.04 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.02s | Since Improv: 2
Episode 3 | Reward: -1633.56 | Avg(10): -1227.88 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.01s | Since Improv: 3
Episode 4 | Reward: -975.98 | Avg(10): -1164.90 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Since Improv: 4
Episode 5 | Reward: -842.49 | Avg(10): -1100.42 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.02s | Since Improv: 5
Episode 6 | Reward: -877.57 | Avg(10): -1063.28 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.02s | Since Improv: 6
Episode 7 | Reward: -768.38 | Avg(10): -1021.15 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.06s | Since Improv: 7
Episode 8 | Reward: -1134.96 | Avg(10): -1035.37 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.06s | Since Improv: 8
Episode 9 | Reward: -1290.34 | Avg(10): -1063.70 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.04s | Since Improv: 9
Episode 10 | Reward: -1786.12 | Avg(10): -1135.95 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.09s | Since Improv: 10
Epsilon restart at episode 20: 0.909 → 0.300
Epsilon restart at episode 40: 0.273 → 0.300
Episode 50 | Reward: -794.09 | Avg(10): -953.84 | ε: 0.285 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 4.75s | Since Improv: 0
Epsilon restart at episode 100: 0.223 → 0.300
Episode 100 | Reward: -1310.24 | Avg(10): -1095.37 | ε: 0.300 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 4.99s | Since Improv: 0
Epsilon restart at episode 120: 0.273 → 0.300
Episode 150 | Reward: -127.63 | Avg(10): -368.02 | ε: 0.258 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 5.12s | Since Improv: 0
Episode 200 | Reward: -246.89 | Avg(10): -242.58 | ε: 0.201 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 6.03s | Since Improv: 0
Episode 250 | Reward: -365.57 | Avg(10): -230.01 | ε: 0.156 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 6.39s | Since Improv: 1
Episode 300 | Reward: -129.06 | Avg(10): -159.71 | ε: 0.122 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 5.03s | Since Improv: 0
Epsilon restart at episode 326: 0.107 → 0.300
Episode 350 | Reward: -130.23 | Avg(10): -231.65 | ε: 0.266 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 5.40s | Since Improv: 0
Episode 400 | Reward: -475.32 | Avg(10): -208.19 | ε: 0.207 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 5.43s | Since Improv: 0
Episode 450 | Reward: -125.23 | Avg(10): -244.58 | ε: 0.161 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 5.71s | Since Improv: 1
Epsilon restart at episode 500: 0.126 → 0.300
Episode 500 | Reward: -125.81 | Avg(10): -169.78 | ε: 0.300 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.69s | Since Improv: 0
Epsilon restart at episode 520: 0.273 → 0.300
Episode 550 | Reward: -125.80 | Avg(10): -244.57 | ε: 0.258 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.45s | Since Improv: 0
Episode 600 | Reward: -2.49 | Avg(10): -198.01 | ε: 0.201 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 4.71s | Since Improv: 12

TRAINING COMPLETED
Episodes trained: 600
Best episode: 260
Best average reward: -124.47
Final epsilon: 0.2009
Total training steps: 118,001
Training time: 3252.81s (5.42s/ep)

Evaluating trained model...

 Evaluating: final_optimized
Using EXACT training config: Memory=100,000, Batch=64, LR=0.0003
 Loaded weights from final_optimized_weights.h5
Running 5 runs × 20 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -157.2 ± 76.1
--- Run 2/5 ---
Run 2: -158.2 ± 67.3
--- Run 3/5 ---
Run 3: -195.9 ± 96.8
--- Run 4/5 ---
Run 4: -115.8 ± 87.4
--- Run 5/5 ---
Run 5: -159.4 ± 94.7

 EVALUATION SUMMARY:
Overall mean: -157.32
Run-to-run std: 25.38
95% CI: [-188.82, -125.81]
--------------------------------------------------

Observations and analysis ¶

  • I just realised that the plateau restart is too aggressuve since it restarted 7 times in 6 episodes.

  • Overfitting or Instability: The discrepancy between the best training average reward (-124.47) and the overall evaluation mean reward (-157.32) suggests that the agent is either overfitting to its training experience or the policy is not completely stable. It learned to perform well in specific training scenarios but struggles to generalize that performance when exploration is turned off during evaluation. The aggressive epsilon restart strategy during training further supports the idea of instability.

  • Hyperparameter Choice (γ = 0.995): The high gamma value is designed for long-term rewards, which seems to have helped the agent improve significantly from its initial performance. However, the plateau restarts and performance drops (e.g., at episode 400) indicate that this hyperparameter alone wasn't enough to stabilize the learning process. The policy might still be struggling with a credit assignment problem in a dynamic environment, where a single action has delayed effects.

  • Evaluation Reliability: The wide confidence interval and high run-to-run standard deviation highlight that the model's performance isn't as robust as we'd hope. While the mean reward of -157.32 is a reasonable estimate, the high variance means you can't be fully certain that the model will perform consistently well in any given run. The agent can get lucky in some runs (e.g., run 4 with -115.8) but perform much worse in others (e.g., run 3 with -195.9).

Personal realisation of why it resulted like this ¶

Issue: (One-Factor-at-a-Time Tuning

  • This method assumes that each hyperparameter's optimal value is independent of the others, which is often not the case.

  • Hyperparameters are not isolated variables; they interact with each other. For example, a learning rate that works well with a small batch size might cause instability with a large batch size. Similarly, the optimal gamma might depend on the specific learning rate.

What I Should Have Done

  • To properly account for these interactions, you need to use a more sophisticated hyperparameter search strategy, such as:

  • Grid Search: Exhaustively test every possible combination of hyperparameters. This is computationally expensive but guarantees finding the best combination within the defined search space.

  • Random Search: Randomly sample combinations of hyperparameters from a defined distribution. This is often more efficient than a grid search and can find good results quickly.

  • Bayesian Optimization: Use a model to predict which hyperparameter combinations are most likely to yield good results, intelligently exploring the search space.


Next steps (Use DDQN) ¶

How can DDQN help?

  • The core issue I'm facing is likely rooted in the instability of agent's value estimates. Standard DQN has a known problem of overestimating Q-values, which can lead to unstable learning and an agent getting stuck on suboptimal policies that it falsely believes are very good. This matches what you're seeing:

  • The PLATEAU RESTART strategy triggering frequently: The agent's performance keeps stalling or dropping, indicating that its learning is not a smooth, continuous improvement but rather a series of gains and then setbacks. This is a classic symptom of unstable Q-value estimates.

  • The gap between training and evaluation performance: The best training reward is significantly better than the evaluation mean. This can happen when the agent overfits to noisy, overestimated Q-values during training.

DDQN directly addresses this overestimation problem. It uses a separate network to select the best action and another (the target network) to evaluate it. This decouples the selection and evaluation, which dramatically reduces the overestimation bias and leads to more stable and reliable learning.¶

Introduced Q value

  • What is the Average Q-value?

    • The average Q-value is the average of the estimated Q-values for the states and actions the agent has experienced during a training episode or a set of episodes. A Q-value, or quality value, is the agent's estimate of the total expected reward it will receive by taking a specific action in a specific state and then following its policy thereafter.

    • A high average Q-value suggests the agent is confident that its policy will lead to high rewards.

    • A low average Q-value suggests the agent is pessimistic about its policy.

    • Monitoring this metric gives me an additional window into the agent's internal learning process, beyond just the final reward.

By monitoring the average Q-value, I get a better sense of this overestimation.

  • Before Double DQN: The average Q-value would likely be high and potentially unstable, reflecting the over-optimism of the agent. This might mask the true performance of the agent.

  • With Double DQN: The very purpose of Double DQN is to reduce this overestimation. By monitoring the average Q-value, I can visually confirm that my algorithm is working as intended. I would expect the average Q-value to be lower and more stable with Double DQN compared to a standard DQN, as the agent's value estimates become more accurate and less biased.

In [22]:
# Set seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

class DQNAgent:
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, epsilon_decay):
        
        self.input_shape = input_shape
        self.n_actions = n_actions
        self.gamma = gamma
        self.replay_memory_size = replay_memory_size
        self.min_replay_memory = min_replay_memory
        self.batch_size = batch_size
        self.target_update_every = target_update_every
        self.learning_rate = learning_rate
        self.epsilon = epsilon_start
        self.epsilon_start = epsilon_start
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        
        self.memory = deque(maxlen=replay_memory_size)
        self.target_update_counter = 0
        
        # Build networks
        self.main_network = self._build_network()
        self.target_network = self._build_network()
        self.update_target()
        
        # Optimizer
        self.optimizer = Adam(learning_rate=learning_rate)
    
    def _build_network(self):
        inputs = Input(shape=(self.input_shape,))
        x = Dense(64, activation='relu')(inputs)
        x = Dense(64, activation='relu')(x)
        outputs = Dense(self.n_actions, activation='linear')(x)
        return Model(inputs=inputs, outputs=outputs)
    
    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(0, self.n_actions)
        q_values = self.main_network(state.reshape(1, -1))
        return np.argmax(q_values[0])
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def train_step(self):
        if len(self.memory) < self.min_replay_memory:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        states = np.array([transition[0] for transition in batch])
        actions = np.array([transition[1] for transition in batch])
        rewards = np.array([transition[2] for transition in batch])
        next_states = np.array([transition[3] for transition in batch])
        dones = np.array([transition[4] for transition in batch])
        
        target_q_values = self.target_network(next_states)
        max_target_q_values = np.max(target_q_values, axis=1)
        targets = rewards + (self.gamma * max_target_q_values * (1 - dones))
        
        with tf.GradientTape() as tape:
            q_values = self.main_network(states, training=True)
            q_values_for_actions = tf.reduce_sum(q_values * tf.one_hot(actions, self.n_actions), axis=1)
            loss = tf.reduce_mean(tf.square(targets - q_values_for_actions))
        
        gradients = tape.gradient(loss, self.main_network.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.main_network.trainable_variables))
    
    def update_target(self):
        self.target_network.set_weights(self.main_network.get_weights())
    
    def save(self, filepath):
        self.main_network.save_weights(filepath)
    
    def load(self, filepath):
        self.main_network.load_weights(filepath)
        self.update_target()
    
    def summary(self):
        self.main_network.summary()

class AdvancedDQNAgent(DQNAgent):
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory, 
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, 
                 epsilon_decay, epsilon_strategy="linear"):
        
        super().__init__(input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                        batch_size, target_update_every, learning_rate, epsilon_start, 
                        epsilon_min, epsilon_decay)
        
        self.epsilon_strategy = epsilon_strategy
        self.epsilon_start = epsilon_start
        self.performance_history = deque(maxlen=50)
        self.last_improvement_episode = 0
        self.plateau_threshold = 20
        
    def adaptive_epsilon_decay(self, episode, recent_performance):
        """Adaptive epsilon based on learning progress"""
        
        if self.epsilon_strategy == "linear":
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "performance_based":
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    decay_rate = 0.998
                    self.last_improvement_episode = episode
                else:
                    decay_rate = 0.992
                    
                return max(self.epsilon_min, self.epsilon * decay_rate)
            else:
                return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
                
        elif self.epsilon_strategy == "plateau_restart":
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    self.last_improvement_episode = episode
                
                episodes_since_improvement = episode - self.last_improvement_episode
                if episodes_since_improvement >= self.plateau_threshold:
                    print(f"Epsilon restart at episode {episode}: {self.epsilon:.3f} → {self.epsilon_start * 0.3:.3f}")
                    self.epsilon = self.epsilon_start * 0.3
                    self.last_improvement_episode = episode
                    return self.epsilon
                    
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "high_exploration":
            epsilon_min_high = 0.15
            return max(epsilon_min_high, self.epsilon * 0.9995)
            
        else:
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def decay_epsilon_advanced(self, episode, recent_performance):
        """Advanced epsilon decay with strategy-specific logic"""
        self.epsilon = self.adaptive_epsilon_decay(episode, recent_performance)

class DoubleDQNAgent(AdvancedDQNAgent):
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory, 
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, 
                 epsilon_decay, epsilon_strategy="plateau_restart"):
        
        super().__init__(input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                        batch_size, target_update_every, learning_rate, epsilon_start, 
                        epsilon_min, epsilon_decay, epsilon_strategy)
        
        # Track for analysis
        self.q_values_history = deque(maxlen=1000)
        
    def train_step(self):
        if len(self.memory) < self.min_replay_memory:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        states = np.array([transition[0] for transition in batch])
        actions = np.array([transition[1] for transition in batch])
        rewards = np.array([transition[2] for transition in batch])
        next_states = np.array([transition[3] for transition in batch])
        dones = np.array([transition[4] for transition in batch])
        
        # DOUBLE DQN: Use main network to select, target network to evaluate
        next_q_values_main = self.main_network(next_states)
        best_actions = tf.argmax(next_q_values_main, axis=1)  # Keep as tensor
        next_q_values_target = self.target_network(next_states)
        
        # Use tf.gather instead of numpy indexing
        batch_indices = tf.range(self.batch_size)
        indices = tf.stack([batch_indices, tf.cast(best_actions, tf.int32)], axis=1)
        max_target_q_values = tf.gather_nd(next_q_values_target, indices)
        
        targets = rewards + (self.gamma * max_target_q_values * (1 - dones))
        
        with tf.GradientTape() as tape:
            q_values = self.main_network(states, training=True)
            q_values_for_actions = tf.reduce_sum(q_values * tf.one_hot(actions, self.n_actions), axis=1)
            loss = tf.reduce_mean(tf.square(targets - q_values_for_actions))
        
        gradients = tape.gradient(loss, self.main_network.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.main_network.trainable_variables))
        
        # Track Q-values for analysis
        self.q_values_history.append(float(tf.reduce_mean(q_values)))

def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)
In [23]:
def evaluate_stability(weights_path, num_episodes=20, num_runs=3):
    """Evaluate model stability across multiple runs"""
    
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_STEPS = 200
    
    # Use your optimized hyperparameters for agent creation
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print(f"Evaluating: {weights_path}")
    
    # Create agent with same config as training
    agent = AdvancedDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    try:
        agent.load(weights_path)
        agent.epsilon = 0.0  # Pure exploitation for evaluation
        print(f"Loaded weights from {weights_path}")
    except FileNotFoundError:
        print(f"ERROR: Weights file {weights_path} not found!")
        return None
    except Exception as e:
        print(f"ERROR loading weights: {e}")
        return None
    
    print(f"Running {num_runs} runs × {num_episodes} episodes (epsilon=0.0)")
    
    all_run_results = []
    all_rewards = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        run_rewards = []
        
        for ep in range(num_episodes):
            state = env.reset()
            if isinstance(state, tuple):
                state = state[0]
            state = np.array(state, dtype=np.float32)
            if state.shape != (3,):
                state = state.flatten()[:3]
            
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(state)
                torque = action_index_to_torque(a_idx, N_ACTIONS)
                
                next_state, reward, done, info = env.step([torque])
                
                if isinstance(next_state, tuple):
                    next_state = next_state[0]
                next_state = np.array(next_state, dtype=np.float32)
                if next_state.shape != (3,):
                    next_state = next_state.flatten()[:3]
                
                total_reward += reward
                state = next_state
                
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        all_rewards.extend(run_rewards)
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Calculate overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_rewards)
    overall_std = np.std(all_rewards)
    run_consistency = np.std(all_means)
    
    print(f"\nEVALUATION SUMMARY:")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Overall std: {overall_std:.2f}")
    print(f"Run-to-run consistency: {run_consistency:.2f} (lower = more consistent)")
    print("-" * 50)
    
    return {
        'mean': overall_mean,
        'std': overall_std,
        'run_consistency': run_consistency,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes
    }
In [171]:
def train_phase1_double_dqn():
    """Phase 1: Test Double DQN impact"""
    
    # YOUR EXACT HYPERPARAMETERS
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_EPISODES = 600
    MAX_STEPS = 200
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    LEARNING_RATE = 3e-4
    BATCH_SIZE = 64
    GAMMA = 0.995
    TARGET_UPDATE_EVERY = 5
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print("=" * 70)
    print("PHASE 1: DOUBLE DQN STABILITY TEST")
    print("ONLY CHANGE: Regular DQN → Double DQN")
    print("Hypothesis: Reduce overestimation bias that causes instability")
    print("=" * 70)

    env = gym.make(ENV_NAME)
    
    agent = DoubleDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    print("DoubleDQNAgent Model Summary:")
    agent.summary()
    print()
    
    scores = []
    best_avg_reward = -np.inf
    epsilon_history = []
    training_steps = 0
    best_episode = 0
    
    start = time.time()

    for ep in range(1, MAX_EPISODES + 1):
        ep_start = time.time()
        s = env.reset()
        s = s if isinstance(s, np.ndarray) else s[0]
        s = np.array(s, dtype=np.float32)
        if s.shape != (3,):
            s = s.flatten()[:3]
            
        total_reward = 0
        episode_training_steps = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, N_ACTIONS)
            
            s_next, r, done, info = env.step([torque])
            s_next = s_next if isinstance(s_next, np.ndarray) else s_next[0]
            s_next = np.array(s_next, dtype=np.float32)
            if s_next.shape != (3,):
                s_next = s_next.flatten()[:3]
            
            agent.remember(s, a_idx, r, s_next, done)
            
            if len(agent.memory) >= MIN_REPLAY_MEMORY:
                agent.train_step()
                training_steps += 1
                episode_training_steps += 1
            
            s = s_next
            total_reward += r
            if done:
                break

        scores.append(total_reward)
        
        # Your plateau restart epsilon strategy
        recent_performance = np.mean(scores[-10:]) if len(scores) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_performance)
        epsilon_history.append(agent.epsilon)
        
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()

        # Save checkpoints
        if ep % 150 == 0:
            agent.save(f"phase1_double_dqn_{ep}_weights.h5")
        
        avg_reward = np.mean(scores[-10:])
        ep_time = time.time() - ep_start
        
        if avg_reward > best_avg_reward:
            best_avg_reward = avg_reward
            best_episode = ep
            agent.save("phase1_double_dqn_weights.h5")
        
        # Progress reporting
        if ep <= 10 or ep % 50 == 0 or ep in [100, 200, 300, 400, 500, 600]:
            memory_pct = (len(agent.memory) / REPLAY_MEMORY_SIZE) * 100
            avg_q = np.mean(list(agent.q_values_history)[-100:]) if len(agent.q_values_history) > 0 else 0
            episodes_since_improvement = ep - agent.last_improvement_episode
            
            print(f"Episode {ep:3d} | Reward: {total_reward:7.2f} | Avg(10): {avg_reward:7.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({memory_pct:.1f}%) | "
                  f"Steps: {episode_training_steps} | Time: {ep_time:.2f}s | "
                  f"Avg Q-val: {avg_q:.2f} | Since Improv: {episodes_since_improvement}")

    env.close()
    total_time = time.time() - start
    avg_time_per_episode = total_time / MAX_EPISODES
    
    print()
    print("PHASE 1 TRAINING COMPLETED")
    print(f"Episodes trained: {MAX_EPISODES}")
    print(f"Best episode: {best_episode}")
    print(f"Best average reward: {best_avg_reward:.2f}")
    print(f"Final epsilon: {agent.epsilon:.4f}")
    print(f"Total training steps: {training_steps:,}")
    print(f"Training time: {total_time:.2f}s ({avg_time_per_episode:.2f}s/ep)")
    print()
    
    return {
        'agent': agent,
        'scores': scores,
        'epsilon_history': epsilon_history,
        'best_training_reward': best_avg_reward,
        'best_episode': best_episode,
        'training_time': total_time,
        'total_training_steps': training_steps
    }
In [172]:
def evaluate_phase1():
    """Evaluate Phase 1 results against baseline"""
    
    print("\n" + "="*70)
    print("PHASE 1 EVALUATION")
    print("Comparing Original vs Double DQN")
    print("="*70)
    
    # Evaluate original baseline
    print("BASELINE - Original Optimized Model:")
    original_results = evaluate_stability("optimized_dqn_weights.h5", num_episodes=20, num_runs=3)
    
    # Evaluate double DQN
    print("\nPHASE 1 - Double DQN Model:")
    double_results = evaluate_stability("phase1_double_dqn_weights.h5", num_episodes=20, num_runs=3)
    
    # Analysis
    print("\n" + "="*70)
    print("PHASE 1 ANALYSIS")
    print("="*70)
    
    if original_results and double_results:
        performance_improvement = double_results['mean'] - original_results['mean']
        stability_improvement = original_results['run_consistency'] - double_results['run_consistency']
        
        print(f"PERFORMANCE COMPARISON:")
        print(f"  Baseline:   {original_results['mean']:7.2f} ± {original_results['std']:5.2f}")
        print(f"  Double DQN: {double_results['mean']:7.2f} ± {double_results['std']:5.2f}")
        print(f"  Improvement: {performance_improvement:+7.2f}")
        print()
        print(f"STABILITY COMPARISON (lower = more stable):")
        print(f"  Baseline consistency:   {original_results['run_consistency']:5.2f}")
        print(f"  Double DQN consistency: {double_results['run_consistency']:5.2f}")
        print(f"  Stability improvement:  {stability_improvement:+5.2f}")
        print()
        
        if performance_improvement > 0:
            print("✓ DOUBLE DQN IMPROVED PERFORMANCE")
        else:
            print("✗ Double DQN did not improve performance")
            
        if stability_improvement > 0:
            print("✓ DOUBLE DQN IMPROVED STABILITY")
        else:
            print("✗ Double DQN did not improve stability")
            
        print()
        print("CONCLUSION:")
        if performance_improvement > 5 and stability_improvement > 0:
            print("Double DQN shows significant improvement. Proceed to Phase 2.")
        elif performance_improvement > 0:
            print("Double DQN shows moderate improvement. Continue testing.")
        else:
            print("Double DQN shows minimal improvement. May need different approach.")
    
    return {
        'baseline': original_results,
        'double_dqn': double_results,
        'performance_improvement': performance_improvement if (original_results and double_results) else None,
        'stability_improvement': stability_improvement if (original_results and double_results) else None
    }

def run_complete_phase1():
    """Run complete Phase 1 experiment"""
    
    print("SYSTEMATIC STABILITY IMPROVEMENT - PHASE 1")
    print("Testing individual impact of Double DQN")
    print("=" * 80)
    print()
    
    # Train Phase 1
    print("TRAINING PHASE 1...")
    training_results = train_phase1_double_dqn()
    
    # Evaluate Phase 1
    print("\nEVALUATING PHASE 1...")
    evaluation_results = evaluate_phase1()
    
    # Final summary
    print("\n" + "="*80)
    print("PHASE 1 COMPLETE")
    print("="*80)
    print("Files saved:")
    print("  - phase1_double_dqn_weights.h5 (best model)")
    print("  - phase1_double_dqn_150_weights.h5 (checkpoint)")
    print("  - phase1_double_dqn_300_weights.h5 (checkpoint)")
    print("  - phase1_double_dqn_450_weights.h5 (checkpoint)")
    print("  - phase1_double_dqn_600_weights.h5 (checkpoint)")
    print()
    print("Next: If results are promising, proceed to Phase 2 (Gradient Clipping)")
    
    return {
        'training': training_results,
        'evaluation': evaluation_results
    }

if name == "main": # Run complete Phase 1 experiment phase1_results = run_complete_phase1()

"SYSTEMATIC STABILITY IMPROVEMENT - PHASE 1

Testing individual impact of Double DQN

================================================================================

TRAINING PHASE 1...

======================================================================

PHASE 1: DOUBLE DQN STABILITY TEST

ONLY CHANGE: Regular DQN → Double DQN

Hypothesis: Reduce overestimation bias that causes instability

======================================================================

DoubleDQNAgent Model Summary:

Model: "model_24"


Layer (type) Output Shape Param #

=================================================================

input_25 (InputLayer) [(None, 3)] 0

dense_198 (Dense) (None, 64) 256

dense_199 (Dense) (None, 64) 4160

dense_200 (Dense) (None, 21) 1365

=================================================================

Total params: 5781 (22.58 KB)

Trainable params: 5781 (22.58 KB)

Non-trainable params: 0 (0.00 Byte)


Episode 1 | Reward: -1150.04 | Avg(10): -1150.04 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Avg Q-val: 0.00 | Since Improv: 1

Episode 2 | Reward: -1487.42 | Avg(10): -1318.73 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 2

Episode 3 | Reward: -882.75 | Avg(10): -1173.40 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.01s | Avg Q-val: 0.00 | Since Improv: 3

Episode 4 | Reward: -1504.23 | Avg(10): -1256.11 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.01s | Avg Q-val: 0.00 | Since Improv: 4

Episode 5 | Reward: -1701.85 | Avg(10): -1345.26 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 5

Episode 6 | Reward: -1067.46 | Avg(10): -1298.96 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.04s | Avg Q-val: 0.00 | Since Improv: 6

Episode 7 | Reward: -1781.46 | Avg(10): -1367.89 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 7

Episode 8 | Reward: -886.36 | Avg(10): -1307.70 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 8

Episode 9 | Reward: -1533.23 | Avg(10): -1332.76 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.04s | Avg Q-val: 0.00 | Since Improv: 9

Episode 10 | Reward: -1389.61 | Avg(10): -1338.44 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.14s | Avg Q-val: -0.08 | Since Improv: 10

Epsilon restart at episode 20: 0.909 → 0.300

Episode 50 | Reward: -1036.67 | Avg(10): -1221.20 | ε: 0.258 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 5.03s | Avg Q-val: -50.99 | Since Improv: 0

Episode 100 | Reward: -638.69 | Avg(10): -544.31 | ε: 0.201 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 5.16s | Avg Q-val: -81.73 | Since Improv: 0

Episode 150 | Reward: -370.54 | Avg(10): -492.28 | ε: 0.156 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 5.24s | Avg Q-val: -96.12 | Since Improv: 0

Episode 200 | Reward: -2.20 | Avg(10): -180.80 | ε: 0.122 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 5.10s | Avg Q-val: -97.23 | Since Improv: 0

Episode 250 | Reward: -241.64 | Avg(10): -224.20 | ε: 0.095 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 5.25s | Avg Q-val: -87.73 | Since Improv: 3

Epsilon restart at episode 267: 0.087 → 0.300

Episode 300 | Reward: -369.53 | Avg(10): -312.20 | ε: 0.254 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 5.06s | Avg Q-val: -80.90 | Since Improv: 11

Episode 350 | Reward: -1.55 | Avg(10): -193.12 | ε: 0.198 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 5.35s | Avg Q-val: -72.85 | Since Improv: 4

Episode 400 | Reward: -1.28 | Avg(10): -146.06 | ε: 0.154 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 5.26s | Avg Q-val: -65.51 | Since Improv: 0

Episode 450 | Reward: -243.56 | Avg(10): -206.56 | ε: 0.120 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 5.29s | Avg Q-val: -53.31 | Since Improv: 5

Episode 500 | Reward: -240.91 | Avg(10): -230.49 | ε: 0.093 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 5.65s | Avg Q-val: -44.81 | Since Improv: 11

Epsilon restart at episode 509: 0.090 → 0.300

Episode 550 | Reward: -123.91 | Avg(10): -227.96 | ε: 0.244 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 7.39s | Avg Q-val: -23.32 | Since Improv: 0

Episode 600 | Reward: -230.57 | Avg(10): -315.96 | ε: 0.190 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 6.93s | Avg Q-val: -15.43 | Since Improv: 8

PHASE 1 TRAINING COMPLETED

Episodes trained: 600

Best episode: 433

Best average reward: -98.60

Final epsilon: 0.1901

Total training steps: 118,001

Training time: 3318.39s (5.53s/ep)

EVALUATING PHASE 1...

======================================================================

PHASE 1 EVALUATION

Comparing Original vs Double DQN

======================================================================

BASELINE - Original Optimized Model:

Evaluating: optimized_dqn_weights.h5

Loaded weights from optimized_dqn_weights.h5

Running 3 runs × 20 episodes (epsilon=0.0)

--- Run 1/3 ---

Run 1: -108.6 ± 74.0

--- Run 2/3 ---

Run 2: -169.2 ± 107.1

--- Run 3/3 ---

Run 3: -150.4 ± 111.3

EVALUATION SUMMARY:

Overall mean: -142.74

Overall std: 102.10

Run-to-run consistency: 25.33 (lower = more consistent)


PHASE 1 - Double DQN Model:

Evaluating: phase1_double_dqn_weights.h5

Loaded weights from phase1_double_dqn_weights.h5

Running 3 runs × 20 episodes (epsilon=0.0)

--- Run 1/3 ---

Run 1: -183.8 ± 111.9

--- Run 2/3 ---

Run 2: -156.7 ± 97.3

--- Run 3/3 ---

Run 3: -158.8 ± 94.2

EVALUATION SUMMARY:

Overall mean: -166.42

Overall std: 102.18

Run-to-run consistency: 12.32 (lower = more consistent)


======================================================================

PHASE 1 ANALYSIS

======================================================================

PERFORMANCE COMPARISON:

Baseline: -142.74 ± 102.10

Double DQN: -166.42 ± 102.18

Improvement: -23.68

STABILITY COMPARISON (lower = more stable):

Baseline consistency: 25.33

Double DQN consistency: 12.32

Stability improvement: +13.01

✗ Double DQN did not improve performance

✓ DOUBLE DQN IMPROVED STABILITY

CONCLUSION:

Double DQN shows minimal improvement. May need different approach."

Metric Baseline (Optimized DQN) Double DQN Observation
Mean reward (eval) -142.74 -166.42 Double DQN performed worse on average reward.
Std deviation (eval) 102.10 102.18 Variability roughly the same.
Run-to-run consistency 25.33 12.32 Double DQN is much more stable between runs!
Training best avg reward N/A -98.60 (best episode 433) Training can achieve better rewards than eval.

Observations

  1. Performance:
  • Double DQN did not improve average performance compared to baseline, despite theoretical expectations. This can happen if baseline is already well-tuned or training duration was insufficient for Double DQN to show its advantage.
  1. Stability:
  • Double DQN improved stability substantially — lower run-to-run consistency score means less variance across evaluation runs, which is valuable in RL where training is often noisy.
  1. Training vs Evaluation Gap:
  • My best training average reward (-98.6) is notably better than evaluation means (~-142 to -166). This gap could hint at overfitting or a discrepancy in evaluation procedure (e.g., training vs pure exploitation).
  1. Training duration:
  • I trained for 600 episodes with checkpoints every 150. The best episode was at 433, which indicates that perhaps the model started overfitting or plateaued afterward. Maybe longer or different training schedules could help.

Analysis

  • Failed Performance Hypothesis, Confirmed Stability Hypothesis: The primary conclusion is that Double DQN successfully improved the agent's stability but did not improve its performance. This is a critical finding. The DDQN's consistency is twice as good as the baseline, indicating that it's more reliable and less prone to random poor performance. However, this stability did not translate to a higher overall mean reward. The agent is consistently bad, rather than randomly good or bad.

  • The Overestimation Problem Might Not Be The Only Issue: The original hypothesis was that instability was caused by Q-value overestimation. Double DQN, by design, mitigates this. Its success in improving stability confirms that overestimation was likely a contributing factor. However, the fact that the mean reward got worse suggests that simply reducing overestimation wasn't enough. The agent might still be failing to learn a good policy for other reasons, possibly related to credit assignment, exploration, or the complexity of the environment.

  • Potential for Further Optimization: The results point to a new direction. The Double DQN agent's improved stability provides a much more solid foundation for further tuning. I now have a more reliable agent, and any improvements from here are more likely to be due to actual policy learning rather than random chance.

I will test to see if training for longer would allow for better performance.¶

In [31]:
def train_double_dqn_with_metrics(episodes=1000, save_prefix="double_dqn_run"):
    """Train Double DQN with full metric logging and reproducibility."""
    SEED = 42
    random.seed(SEED)
    np.random.seed(SEED)
    tf.random.set_seed(SEED)
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_STEPS = 200
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print(f"\nTraining Double DQN ({episodes} episodes) with consistent conditions")
    print(f"Random seed: {SEED}")
    
    env = gym.make(ENV_NAME)
    # For reproducibility across Gym versions
    try:
        env.reset(seed=SEED)
    except TypeError:
        pass  # For older Gym versions
    
    agent = DoubleDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    metrics = {
        'episode': [],
        'reward': [],
        'avg10': [],
        'avg50': [],
        'epsilon': [],
        'memory': [],
        'q_values': [],
        'episode_steps': [],
        'episode_time': [],
        'since_improv': [],
    }
    best_avg_reward = -np.inf
    best_episode = 0
    total_training_steps = 0
    start_time = time.time()
    
    for ep in range(1, episodes + 1):
        ep_start = time.time()
        # Version-agnostic reset
        try:
            s = env.reset(seed=SEED + ep)
            if isinstance(s, tuple):  # Newer gym/gymnasium
                s = s[0]
        except TypeError:
            env.seed(SEED + ep)
            s = env.reset()
        s = np.asarray(s, dtype=np.float32).flatten()[:3]
        total_reward = 0
        episode_training_steps = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, N_ACTIONS)
            s_next, r, done, *info = env.step([torque])
            s_next = s_next[0] if isinstance(s_next, tuple) else s_next
            s_next = np.asarray(s_next, dtype=np.float32).flatten()[:3]
            
            agent.remember(s, a_idx, r, s_next, done)
            
            if len(agent.memory) >= MIN_REPLAY_MEMORY:
                agent.train_step()
                total_training_steps += 1
                episode_training_steps += 1
            
            s = s_next
            total_reward += r
            if done:
                break

        # Epsilon update and target network update
        agent.performance_history.append(total_reward)
        recent_performance = np.mean(list(agent.performance_history)[-10:]) if len(agent.performance_history) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_performance)
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()
        
        # Rolling averages
        avg10 = np.mean(metrics['reward'][-9:] + [total_reward]) if len(metrics['reward']) >= 9 else np.mean([total_reward])
        avg50 = np.mean(metrics['reward'][-49:] + [total_reward]) if len(metrics['reward']) >= 49 else np.mean([total_reward])
        avg_q = np.mean(list(agent.q_values_history)[-100:]) if agent.q_values_history else 0
        since_improv = ep - best_episode
        
        # Save metrics
        metrics['episode'].append(ep)
        metrics['reward'].append(total_reward)
        metrics['avg10'].append(avg10)
        metrics['avg50'].append(avg50)
        metrics['epsilon'].append(agent.epsilon)
        metrics['memory'].append(len(agent.memory))
        metrics['q_values'].append(avg_q)
        metrics['episode_steps'].append(episode_training_steps)
        metrics['episode_time'].append(time.time() - ep_start)
        metrics['since_improv'].append(since_improv)
        
        # Save best model
        if avg10 > best_avg_reward:
            best_avg_reward = avg10
            best_episode = ep
            agent.save(f"{save_prefix}_best_weights.h5")

        # Save best model up to episode 600 for "standard" checkpoint
        if ep <= 600:
            if ep == 1 or avg10 > (metrics.get('best_avg_reward_600', -np.inf)):
                agent.save(f"{save_prefix}_600ep_best_weights.h5")
                metrics['best_avg_reward_600'] = avg10
                metrics['best_episode_600'] = ep
        
        # Logging (matches original style)
        if ep <= 10 or ep % 50 == 0 or ep in [100, 200, 300, 400, 500, 600, 800, 1000]:
            print(f"Episode {ep:3d} | Reward: {total_reward:7.2f} | Avg(10): {avg10:7.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({len(agent.memory)/REPLAY_MEMORY_SIZE:.1%}) | "
                  f"Steps: {episode_training_steps} | Time: {metrics['episode_time'][-1]:.2f}s | "
                  f"Avg Q-val: {avg_q:.2f} | Since Improv: {since_improv}")
    
    env.close()
    training_time = time.time() - start_time
    pd.DataFrame(metrics).to_csv(f"{save_prefix}_metrics.csv", index=False)
    print(f"Training complete. Best avg(10): {best_avg_reward:.2f} at episode {best_episode}")
    return {
        'agent': agent,
        'metrics': metrics,
        'total_time': training_time,
        'best_avg_reward': best_avg_reward,
        'best_episode': best_episode,
        'total_training_steps': total_training_steps,
    }
In [32]:
def compare_training_durations_single_run():
    """Train Double DQN for 1000 episodes and compare metrics at 600 and 1000 episodes."""
    print("\n=== Training Double DQN (1000 episodes) ===")
    results = train_double_dqn_with_metrics(episodes=1000, save_prefix="double_dqn")

    # Simulate "standard" (600) by slicing metrics
    metrics = results['metrics']
    standard_metrics = {
        k: (v[:600] if isinstance(v, (list, np.ndarray)) else v)
        for k, v in metrics.items()
    }
    extended_metrics = metrics

    print("\n=== Evaluating Models ===")
    standard_eval = evaluate_stability("double_dqn_600ep_best_weights.h5", num_episodes=50, num_runs=5)
    extended_eval = evaluate_stability("double_dqn_best_weights.h5", num_episodes=50, num_runs=5)

    print("\n=== Training Duration Comparison Results ===")
    print(f"{'Metric':<30} | {'Standard (600)':<15} | {'Extended (1000)':<15}")
    print("-"*70)
    print(f"{'Best Training Avg Reward':<30} | {np.max(standard_metrics['avg10']):15.2f} | {np.max(extended_metrics['avg10']):15.2f}")
    print(f"{'Final Avg Reward (50)':<30} | {np.mean(standard_metrics['avg50'][-10:]):15.2f} | {np.mean(extended_metrics['avg50'][-10:]):15.2f}")
    print(f"{'Total Training Time (hrs)':<30} | {results['total_time']/3600:15.2f} | {results['total_time']/3600:15.2f}")
    print()
    print("Evaluation Metrics:")
    print(f"{'Mean Reward':<30} | {standard_eval['mean']:15.2f} | {extended_eval['mean']:15.2f}")
    print(f"{'Reward Std':<30} | {standard_eval['std']:15.2f} | {extended_eval['std']:15.2f}")
    print(f"{'Run Consistency':<30} | {standard_eval['run_consistency']:15.2f} | {extended_eval['run_consistency']:15.2f}")

    # Plot learning curves
    plt.figure(figsize=(12, 6))
    plt.plot(standard_metrics['episode'], standard_metrics['avg50'], label='Standard (600 eps)')
    plt.plot(extended_metrics['episode'], extended_metrics['avg50'], label='Extended (1000 eps)')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward (50 eps)')
    plt.title('Double DQN Learning Curve Comparison')
    plt.legend()
    plt.grid()
    plt.savefig('training_duration_comparison.png')
    plt.show()

    return {
        'standard': standard_metrics,
        'extended': extended_metrics,
        'standard_eval': standard_eval,
        'extended_eval': extended_eval
    }
In [33]:
def enhanced_evaluate_stability(weights_path, num_episodes=50, num_runs=5):
    """Enhanced evaluation with confidence intervals and more metrics"""
    import numpy as np
    import matplotlib.pyplot as plt

    results = evaluate_stability(weights_path, num_episodes, num_runs)
    if not results:
        return None

    # Calculate 95% confidence interval
    sem = results['std'] / np.sqrt(len(results['all_rewards']))
    ci_width = 1.96 * sem
    ci_95 = (results['mean'] - ci_width, results['mean'] + ci_width)

    print("\nEnhanced Evaluation Metrics:")
    print(f"95% Confidence Interval: {results['mean']:.2f} ± {ci_width:.2f}")
    print(f"Reward Range: {np.min(results['all_rewards']):.2f} to {np.max(results['all_rewards']):.2f}")

    # Plot reward distribution
    plt.figure(figsize=(10, 5))
    plt.hist(results['all_rewards'], bins=20, color='blue', alpha=0.7)
    plt.axvline(results['mean'], color='r', linestyle='dashed', linewidth=1)
    plt.title(f'Reward Distribution (n={len(results["all_rewards"])})')
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    plt.savefig(f'{weights_path}_reward_dist.png')
    plt.show()

    results['ci_95'] = ci_95
    return results
In [34]:
if __name__ == "__main__":
    # Run the single 1000-episode training and comparison
    comparison_results = compare_training_durations_single_run()
    
    # Enhanced evaluation of best model
    if comparison_results['extended_eval']['mean'] > comparison_results['standard_eval']['mean']:
        best_model = "double_dqn_best_weights.h5"
    else:
        best_model = "double_dqn_600ep_best_weights.h5"
    
    print(f"\nRunning enhanced evaluation on best model: {best_model}")
    enhanced_results = enhanced_evaluate_stability(best_model)
=== Training Double DQN (1000 episodes) ===

Training Double DQN (1000 episodes) with consistent conditions
Random seed: 42
Episode   1 | Reward: -1330.80 | Avg(10): -1330.80 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Avg Q-val: 0.00 | Since Improv: 1
Episode   2 | Reward: -972.86 | Avg(10): -972.86 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.05s | Avg Q-val: 0.00 | Since Improv: 1
Episode   3 | Reward: -1701.14 | Avg(10): -1701.14 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Avg Q-val: 0.00 | Since Improv: 1
Episode   4 | Reward: -888.86 | Avg(10): -888.86 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 2
Episode   5 | Reward: -978.90 | Avg(10): -978.90 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 1
Episode   6 | Reward: -1263.51 | Avg(10): -1263.51 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 2
Episode   7 | Reward: -1757.85 | Avg(10): -1757.85 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 3
Episode   8 | Reward: -1299.19 | Avg(10): -1299.19 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 4
Episode   9 | Reward: -1489.04 | Avg(10): -1489.04 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.05s | Avg Q-val: 0.00 | Since Improv: 5
Episode  10 | Reward: -1538.23 | Avg(10): -1322.04 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.14s | Avg Q-val: -0.07 | Since Improv: 6
Episode  50 | Reward: -764.44 | Avg(10): -1183.24 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 7.34s | Avg Q-val: -44.60 | Since Improv: 46
Episode 100 | Reward: -899.69 | Avg(10): -1052.50 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 7.52s | Avg Q-val: -76.56 | Since Improv: 96
Episode 150 | Reward: -982.70 | Avg(10): -981.62 | ε: 0.471 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 7.60s | Avg Q-val: -99.02 | Since Improv: 146
Episode 200 | Reward: -368.59 | Avg(10): -631.55 | ε: 0.367 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 8.06s | Avg Q-val: -119.93 | Since Improv: 25
Episode 250 | Reward: -381.00 | Avg(10): -437.74 | ε: 0.286 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 7.94s | Avg Q-val: -133.93 | Since Improv: 8
Episode 300 | Reward: -363.70 | Avg(10): -223.77 | ε: 0.222 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 10.35s | Avg Q-val: -130.62 | Since Improv: 3
Episode 350 | Reward: -383.50 | Avg(10): -305.65 | ε: 0.173 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 9.88s | Avg Q-val: -121.50 | Since Improv: 9
Episode 400 | Reward: -244.90 | Avg(10): -207.67 | ε: 0.135 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 10.66s | Avg Q-val: -109.77 | Since Improv: 14
Episode 450 | Reward: -123.09 | Avg(10): -133.71 | ε: 0.105 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 9.85s | Avg Q-val: -88.51 | Since Improv: 37
Episode 500 | Reward: -124.29 | Avg(10): -180.17 | ε: 0.082 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 10.20s | Avg Q-val: -70.92 | Since Improv: 87
Episode 550 | Reward: -348.63 | Avg(10): -177.86 | ε: 0.063 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 10.33s | Avg Q-val: -44.97 | Since Improv: 137
Episode 600 | Reward:   -1.75 | Avg(10): -129.95 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 9.61s | Avg Q-val: -23.39 | Since Improv: 187
Episode 650 | Reward: -124.82 | Avg(10): -131.28 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 10.07s | Avg Q-val: -6.43 | Since Improv: 237
Episode 700 | Reward: -118.18 | Avg(10): -169.66 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 8.15s | Avg Q-val: 6.49 | Since Improv: 287
Episode 750 | Reward: -121.13 | Avg(10): -184.77 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 8.44s | Avg Q-val: 11.65 | Since Improv: 337
Episode 800 | Reward: -233.79 | Avg(10): -164.00 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 8.10s | Avg Q-val: 13.49 | Since Improv: 387
Episode 850 | Reward:   -1.14 | Avg(10): -186.02 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 8.28s | Avg Q-val: 11.38 | Since Improv: 437
Episode 900 | Reward: -117.05 | Avg(10): -144.90 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 9.94s | Avg Q-val: 11.17 | Since Improv: 487
Episode 950 | Reward: -118.95 | Avg(10): -199.69 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 9.98s | Avg Q-val: 9.71 | Since Improv: 537
Episode 1000 | Reward: -122.41 | Avg(10): -235.46 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 8.35s | Avg Q-val: 7.19 | Since Improv: 587
Training complete. Best avg(10): -75.30 at episode 413

=== Evaluating Models ===
Evaluating: double_dqn_600ep_best_weights.h5
Loaded weights from double_dqn_600ep_best_weights.h5
Running 5 runs × 50 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -193.2 ± 103.7
--- Run 2/5 ---
Run 2: -173.6 ± 123.9
--- Run 3/5 ---
Run 3: -179.3 ± 101.7
--- Run 4/5 ---
Run 4: -173.7 ± 106.9
--- Run 5/5 ---
Run 5: -166.1 ± 92.3

EVALUATION SUMMARY:
Overall mean: -177.17
Overall std: 106.60
Run-to-run consistency: 9.05 (lower = more consistent)
--------------------------------------------------
Evaluating: double_dqn_best_weights.h5
Loaded weights from double_dqn_best_weights.h5
Running 5 runs × 50 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -173.3 ± 85.0
--- Run 2/5 ---
Run 2: -156.2 ± 103.2
--- Run 3/5 ---
Run 3: -174.4 ± 89.7
--- Run 4/5 ---
Run 4: -168.6 ± 94.4
--- Run 5/5 ---
Run 5: -170.1 ± 100.0

EVALUATION SUMMARY:
Overall mean: -168.51
Overall std: 94.92
Run-to-run consistency: 6.51 (lower = more consistent)
--------------------------------------------------

=== Training Duration Comparison Results ===
Metric                         | Standard (600)  | Extended (1000)
----------------------------------------------------------------------
Best Training Avg Reward       |          -75.30 |          -75.30
Final Avg Reward (50)          |         -174.57 |         -182.83
Total Training Time (hrs)      |            2.51 |            2.51

Evaluation Metrics:
Mean Reward                    |         -177.17 |         -168.51
Reward Std                     |          106.60 |           94.92
Run Consistency                |            9.05 |            6.51
No description has been provided for this image
Running enhanced evaluation on best model: double_dqn_best_weights.h5
Evaluating: double_dqn_best_weights.h5
Loaded weights from double_dqn_best_weights.h5
Running 5 runs × 50 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -181.9 ± 103.0
--- Run 2/5 ---
Run 2: -167.9 ± 102.5
--- Run 3/5 ---
Run 3: -169.2 ± 97.2
--- Run 4/5 ---
Run 4: -155.5 ± 92.5
--- Run 5/5 ---
Run 5: -165.2 ± 98.9

EVALUATION SUMMARY:
Overall mean: -167.93
Overall std: 99.27
Run-to-run consistency: 8.48 (lower = more consistent)
--------------------------------------------------

Enhanced Evaluation Metrics:
95% Confidence Interval: -167.93 ± 12.31
Reward Range: -448.29 to -1.17
No description has been provided for this image

Observations and analysis

  1. Learning Curve (Image)
  • The average reward improves steadily and significantly over the first ~400 episodes, then plateaus and stabilizes from about episode 500 onwards.
  • After episode 600, the curve remains quite stable, with no further dramatic improvement but also no instability.
  • There are no signs of overfitting or collapse; performance remains strong throughout the last 400 episodes.
  1. Quantitative Metrics
  • Best Training Avg(10): -75.30 at episode 413 (i.e., best rolling 10-episode average reward achieved before episode 600).
  • Final Avg Reward (50):
    • At 600 episodes: -174.57
    • At 1000 episodes: -182.83
    • (Shows stability, not much difference between 600 and 1000, but not further improvement.)
  • Evaluation Mean Reward:
    • 600-ep model: -177.17 ± 106.60
    • 1000-ep model: -168.51 ± 94.92
    • Enhanced evaluation (1000ep): -167.93 ± 99.27 (with 95% CI ≈ 12.31)
  • Run-to-run consistency:
    • 600ep: 9.05
    • 1000ep: 6.51
    • (Lower is better: 1000ep model is a bit more consistent.)
  1. Interpretation
  • The agent learns rapidly in the first few hundred episodes, and the major gains are made before episode 500.
  • There is no meaningful improvement in max rolling average reward after 600 episodes. The best performance is still at episode 413.
  • The 1000-episode model is slightly more stable and consistent (lower std and run-to-run consistency), but not significantly better in mean reward.
  • Extended training does not hurt performance; it seems to slightly stabilize it.

From the above, we will continue to use 1000 episodes

  1. Stable Baseline:
  • My learning curve and results show that training stabilizes by episode 600, and running to 1000 gives me stable, robust results with slightly better consistency and lower variance.
  1. Fair Comparison:
  • Using 1000 episodes for all subsequent experiments (gradient clipping, soft target updates, etc.) ensures a fair apples-to-apples comparison. I want to compare each new technique to my “best effort baseline,” not a shorter or less stable run.
  1. Detects Subtle Gains:
  • Improvements from techniques like gradient clipping or soft target updates may be subtle, often seen in stability, final performance, or robustness to random seeds.

  • Longer training (1000 episodes) gives these methods room to show benefit and avoids misleading results from early plateaus or noise.

  1. No Downside:
  • I have confirmed that running to 1000 episodes does not cause overfitting or performance drop. There’s no significant extra cost, and you get more reliable statistics.

Model Architecture¶

What is Dueling DQN?

  • Dueling DQN is not an entirely new algorithm but rather a change to the network architecture of a DQN agent. It's a method to improve the way the Q-function is learned.

  • Instead of having a single neural network output the Q-values for all actions, a Dueling network splits the final layers into two separate streams:

    • Value Stream (V(s)): This stream estimates the intrinsic value of being in a particular state s, regardless of which action is taken. It tells you how good a state is overall.

    • Advantage Stream (A(s,a)): This stream estimates the advantage of taking a specific action a in a given state s. It tells you how much better or worse that action is compared to the other actions available in that state

Why is this a valid next step ?

  • Double DQN solves for a stable agent, my next challenge was to improve its learning efficiency and performance. The agent was reliable, but its average reward was still not as high as I'd hoped, and there was a clear gap between its best training performance and its evaluation performance. I needed a way for the network to learn more effectively from its experiences.

  • That is where dueling DQN comes in. Thus architecture separates the estimation of a state's value from the advantage of each action. This is a powerful technique because it allows the network to learn about the 'goodness' of a state independently from the specific actions taken within it. It improves generalization and can lead to faster and more robust learning

In [73]:
# Set seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)

class DuelingDQNAgent:
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, 
                 epsilon_decay, epsilon_strategy="linear"):
        
        self.input_shape = input_shape
        self.n_actions = n_actions
        self.gamma = gamma
        self.replay_memory_size = replay_memory_size
        self.min_replay_memory = min_replay_memory
        self.batch_size = batch_size
        self.target_update_every = target_update_every
        self.learning_rate = learning_rate
        self.epsilon = epsilon_start
        self.epsilon_start = epsilon_start
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.epsilon_strategy = epsilon_strategy
        
        self.memory = deque(maxlen=replay_memory_size)
        self.target_update_counter = 0
        self.performance_history = deque(maxlen=50)
        self.last_improvement_episode = 0
        self.plateau_threshold = 20
        self.q_values_history = deque(maxlen=1000)
        
        # Build networks
        self.main_network = self._build_network()
        self.target_network = self._build_network()
        self.update_target()
        
        # Optimizer
        self.optimizer = Adam(learning_rate=learning_rate)
    
    def _build_network(self):
        inputs = Input(shape=(self.input_shape,))
        
        # Common feature extraction
        x = Dense(64, activation='relu')(inputs)
        x = Dense(64, activation='relu')(x)
        
        # Dueling architecture streams
        # Value stream - how good is the state
        value_stream = Dense(32, activation='relu')(x)
        value = Dense(1, activation='linear')(value_stream)
        
        # Advantage stream - how good is each action
        advantage_stream = Dense(32, activation='relu')(x)
        advantage = Dense(self.n_actions, activation='linear')(advantage_stream)
        
        # Combine streams using dueling formula
        outputs = value + (advantage - tf.reduce_mean(advantage, axis=1, keepdims=True))
        
        return Model(inputs=inputs, outputs=outputs)
    
    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(0, self.n_actions)
        q_values = self.main_network(state.reshape(1, -1))
        return np.argmax(q_values[0])
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def train_step(self):
        if len(self.memory) < self.min_replay_memory:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        states = np.array([transition[0] for transition in batch])
        actions = np.array([transition[1] for transition in batch])
        rewards = np.array([transition[2] for transition in batch])
        next_states = np.array([transition[3] for transition in batch])
        dones = np.array([transition[4] for transition in batch])
        
        # DOUBLE DQN: Use main network to select, target network to evaluate
        next_q_values_main = self.main_network(next_states)
        best_actions = tf.argmax(next_q_values_main, axis=1)  # Keep as tensor
        next_q_values_target = self.target_network(next_states)
        
        # Use tf.gather instead of numpy indexing
        batch_indices = tf.range(self.batch_size, dtype=tf.int32)
        indices = tf.stack([batch_indices, tf.cast(best_actions, tf.int32)], axis=1)
        max_target_q_values = tf.gather_nd(next_q_values_target, indices)
        
        targets = rewards + (self.gamma * max_target_q_values * (1 - dones))
        
        with tf.GradientTape() as tape:
            q_values = self.main_network(states, training=True)
            q_values_for_actions = tf.reduce_sum(q_values * tf.one_hot(actions, self.n_actions), axis=1)
            loss = tf.reduce_mean(tf.square(targets - q_values_for_actions))
        
        gradients = tape.gradient(loss, self.main_network.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.main_network.trainable_variables))
        
        # Track Q-values for analysis
        self.q_values_history.append(float(tf.reduce_mean(q_values)))
    
    def update_target(self):
        self.target_network.set_weights(self.main_network.get_weights())
    
    def adaptive_epsilon_decay(self, episode, recent_performance):
        """Adaptive epsilon based on learning progress"""
        
        if self.epsilon_strategy == "linear":
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "performance_based":
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    decay_rate = 0.998
                    self.last_improvement_episode = episode
                else:
                    decay_rate = 0.992
                    
                return max(self.epsilon_min, self.epsilon * decay_rate)
            else:
                return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
                
        elif self.epsilon_strategy == "plateau_restart":
            self.performance_history.append(recent_performance)
            
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                
                if recent_avg > older_avg + 5:
                    self.last_improvement_episode = episode
                
                episodes_since_improvement = episode - self.last_improvement_episode
                if episodes_since_improvement >= self.plateau_threshold:
                    print(f"Epsilon restart at episode {episode}: {self.epsilon:.3f} → {self.epsilon_start * 0.3:.3f}")
                    self.epsilon = self.epsilon_start * 0.3
                    self.last_improvement_episode = episode
                    return self.epsilon
                    
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
        elif self.epsilon_strategy == "high_exploration":
            epsilon_min_high = 0.15
            return max(epsilon_min_high, self.epsilon * 0.9995)
            
        else:
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def decay_epsilon_advanced(self, episode, recent_performance):
        """Advanced epsilon decay with strategy-specific logic"""
        self.epsilon = self.adaptive_epsilon_decay(episode, recent_performance)
    
    def save(self, filepath):
        self.main_network.save_weights(filepath)
    
    def load(self, filepath):
        self.main_network.load_weights(filepath)
        self.update_target()
    
    def summary(self):
        self.main_network.summary()
In [75]:
def train_dueling_dqn_with_metrics(episodes=1000, save_prefix="dueling_dqn"):
    """Train Dueling DQN with full metric logging and reproducibility"""
    
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_STEPS = 200
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print(f"\n=== Training Dueling DQN ({episodes} episodes) ===")
    print(f"Random seed: {SEED}")
    
    env = gym.make(ENV_NAME)
    # For reproducibility across Gym versions
    try:
        env.reset(seed=SEED)
    except TypeError:
        pass  # For older Gym versions
    
    agent = DuelingDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    metrics = {
        'episode': [],
        'reward': [],
        'avg10': [],
        'avg50': [],
        'epsilon': [],
        'memory': [],
        'q_values': [],
        'episode_steps': [],
        'episode_time': [],
        'since_improv': [],
    }
    best_avg_reward = -np.inf
    best_episode = 0
    total_training_steps = 0
    start_time = time.time()
    
    for ep in range(1, episodes + 1):
        ep_start = time.time()
        # Version-agnostic reset
        try:
            s = env.reset(seed=SEED + ep)
            if isinstance(s, tuple):  # Newer gym/gymnasium
                s = s[0]
        except TypeError:
            env.seed(SEED + ep)
            s = env.reset()
        s = np.asarray(s, dtype=np.float32).flatten()[:3]
        total_reward = 0
        episode_training_steps = 0

        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, N_ACTIONS)
            s_next, r, done, *info = env.step([torque])
            s_next = s_next[0] if isinstance(s_next, tuple) else s_next
            s_next = np.asarray(s_next, dtype=np.float32).flatten()[:3]
            
            agent.remember(s, a_idx, r, s_next, done)
            
            if len(agent.memory) >= MIN_REPLAY_MEMORY:
                agent.train_step()
                total_training_steps += 1
                episode_training_steps += 1
            
            s = s_next
            total_reward += r
            if done:
                break

        # Epsilon update and target network update
        agent.performance_history.append(total_reward)
        recent_performance = np.mean(list(agent.performance_history)[-10:]) if len(agent.performance_history) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_performance)
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()
        
        # Rolling averages
        avg10 = np.mean(metrics['reward'][-9:] + [total_reward]) if len(metrics['reward']) >= 9 else np.mean([total_reward])
        avg50 = np.mean(metrics['reward'][-49:] + [total_reward]) if len(metrics['reward']) >= 49 else np.mean([total_reward])
        avg_q = np.mean(list(agent.q_values_history)[-100:]) if agent.q_values_history else 0
        since_improv = ep - best_episode
        
        # Save metrics
        metrics['episode'].append(ep)
        metrics['reward'].append(total_reward)
        metrics['avg10'].append(avg10)
        metrics['avg50'].append(avg50)
        metrics['epsilon'].append(agent.epsilon)
        metrics['memory'].append(len(agent.memory))
        metrics['q_values'].append(avg_q)
        metrics['episode_steps'].append(episode_training_steps)
        metrics['episode_time'].append(time.time() - ep_start)
        metrics['since_improv'].append(since_improv)
        
        # Save best model
        if avg10 > best_avg_reward:
            best_avg_reward = avg10
            best_episode = ep
            agent.save(f"{save_prefix}_best_weights.h5")

        # Save best model up to episode 600 for "standard" checkpoint
        if ep <= 600:
            if ep == 1 or avg10 > (metrics.get('best_avg_reward_600', -np.inf)):
                agent.save(f"{save_prefix}_600ep_best_weights.h5")
                metrics['best_avg_reward_600'] = avg10
                metrics['best_episode_600'] = ep
        
        # Logging (matches original style)
        if ep <= 10 or ep % 50 == 0 or ep in [100, 200, 300, 400, 500, 600, 800, 1000]:
            print(f"Episode {ep:3d} | Reward: {total_reward:7.2f} | Avg(10): {avg10:7.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({len(agent.memory)/REPLAY_MEMORY_SIZE:.1%}) | "
                  f"Steps: {episode_training_steps} | Time: {metrics['episode_time'][-1]:.2f}s | "
                  f"Avg Q-val: {avg_q:.2f} | Since Improv: {since_improv}")
    
    env.close()
    training_time = time.time() - start_time
    pd.DataFrame(metrics).to_csv(f"{save_prefix}_metrics.csv", index=False)
    print(f"Training complete. Best avg(10): {best_avg_reward:.2f} at episode {best_episode}")
    return {
        'agent': agent,
        'metrics': metrics,
        'total_time': training_time,
        'best_avg_reward': best_avg_reward,
        'best_episode': best_episode,
        'total_training_steps': total_training_steps,
    }

def evaluate_stability(weights_path, num_episodes=50, num_runs=5):
    """Evaluate model stability across multiple runs"""
    
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_STEPS = 200
    
    # Use optimized hyperparameters for agent creation
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print(f"\nEvaluating: {weights_path}")
    
    # Create agent with same config as training
    agent = DuelingDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    try:
        agent.load(weights_path)
        agent.epsilon = 0.0  # Pure exploitation for evaluation
        print(f"Loaded weights from {weights_path}")
    except FileNotFoundError:
        print(f"ERROR: Weights file {weights_path} not found!")
        return None
    except Exception as e:
        print(f"ERROR loading weights: {e}")
        return None
    
    print(f"Running {num_runs} runs × {num_episodes} episodes (epsilon=0.0)")
    
    all_run_results = []
    all_rewards = []
    
    for run in range(num_runs):
        print(f"--- Run {run+1}/{num_runs} ---")
        env = gym.make('Pendulum-v0')
        run_rewards = []
        
        for ep in range(num_episodes):
            state = env.reset()
            if isinstance(state, tuple):
                state = state[0]
            state = np.array(state, dtype=np.float32)
            if state.shape != (3,):
                state = state.flatten()[:3]
            
            total_reward = 0
            
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(state)
                torque = action_index_to_torque(a_idx, N_ACTIONS)
                
                next_state, reward, done, info = env.step([torque])
                
                if isinstance(next_state, tuple):
                    next_state = next_state[0]
                next_state = np.array(next_state, dtype=np.float32)
                if next_state.shape != (3,):
                    next_state = next_state.flatten()[:3]
                
                total_reward += reward
                state = next_state
                
                if done:
                    break
            
            run_rewards.append(total_reward)
        
        env.close()
        
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({
            'mean': run_mean,
            'std': run_std,
            'rewards': run_rewards
        })
        all_rewards.extend(run_rewards)
        
        print(f"Run {run+1}: {run_mean:.1f} ± {run_std:.1f}")
    
    # Calculate overall statistics
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_rewards)
    overall_std = np.std(all_rewards)
    run_consistency = np.std(all_means)
    
    print(f"\nEVALUATION SUMMARY:")
    print(f"Overall mean: {overall_mean:.2f}")
    print(f"Overall std: {overall_std:.2f}")
    print(f"Run-to-run consistency: {run_consistency:.2f} (lower = more consistent)")
    print("-" * 50)
    
    return {
        'mean': overall_mean,
        'std': overall_std,
        'run_consistency': run_consistency,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes
    }
In [76]:
def enhanced_evaluate_stability(weights_path, num_episodes=50, num_runs=5):
    """Enhanced evaluation with confidence intervals and more metrics"""
    results = evaluate_stability(weights_path, num_episodes, num_runs)
    if not results:
        return None

    # Calculate 95% confidence interval
    sem = results['std'] / np.sqrt(len(results['all_rewards']))
    ci_width = 1.96 * sem
    ci_95 = (results['mean'] - ci_width, results['mean'] + ci_width)

    print("\nEnhanced Evaluation Metrics:")
    print(f"95% Confidence Interval: {results['mean']:.2f} ± {ci_width:.2f}")
    print(f"Reward Range: {np.min(results['all_rewards']):.2f} to {np.max(results['all_rewards']):.2f}")

    # Plot reward distribution
    plt.figure(figsize=(10, 5))
    plt.hist(results['all_rewards'], bins=20, color='blue', alpha=0.7)
    plt.axvline(results['mean'], color='r', linestyle='dashed', linewidth=1)
    plt.title(f'Reward Distribution (n={len(results["all_rewards"])})')
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    plt.savefig(f'{weights_path}_reward_dist.png')
    plt.show()

    results['ci_95'] = ci_95
    return results
In [77]:
def compare_dueling_training_durations():
    """Train Dueling DQN for 1000 episodes and compare metrics at 600 and 1000 episodes."""
    print("\n=== Training Dueling DQN (1000 episodes) ===")
    results = train_dueling_dqn_with_metrics(episodes=1000, save_prefix="dueling_dqn")

    # Simulate "standard" (600) by slicing metrics
    metrics = results['metrics']
    standard_metrics = {
        k: (v[:600] if isinstance(v, (list, np.ndarray)) else v)
        for k, v in metrics.items()
    }
    extended_metrics = metrics

    print("\n=== Evaluating Models ===")
    standard_eval = evaluate_stability("dueling_dqn_600ep_best_weights.h5", num_episodes=50, num_runs=5)
    extended_eval = evaluate_stability("dueling_dqn_best_weights.h5", num_episodes=50, num_runs=5)

    print("\n=== Training Duration Comparison Results ===")
    print(f"{'Metric':<30} | {'Standard (600)':<15} | {'Extended (1000)':<15}")
    print("-"*70)
    print(f"{'Best Training Avg Reward':<30} | {np.max(standard_metrics['avg10']):15.2f} | {np.max(extended_metrics['avg10']):15.2f}")
    print(f"{'Final Avg Reward (50)':<30} | {np.mean(standard_metrics['avg50'][-10:]):15.2f} | {np.mean(extended_metrics['avg50'][-10:]):15.2f}")
    print(f"{'Total Training Time (hrs)':<30} | {results['total_time']/3600:15.2f} | {results['total_time']/3600:15.2f}")
    print()
    print("Evaluation Metrics:")
    print(f"{'Mean Reward':<30} | {standard_eval['mean']:15.2f} | {extended_eval['mean']:15.2f}")
    print(f"{'Reward Std':<30} | {standard_eval['std']:15.2f} | {extended_eval['std']:15.2f}")
    print(f"{'Run Consistency':<30} | {standard_eval['run_consistency']:15.2f} | {extended_eval['run_consistency']:15.2f}")

    # Plot learning curves
    plt.figure(figsize=(12, 6))
    plt.plot(standard_metrics['episode'], standard_metrics['avg50'], label='Standard (600 eps)')
    plt.plot(extended_metrics['episode'], extended_metrics['avg50'], label='Extended (1000 eps)')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward (50 eps)')
    plt.title('Dueling DQN Learning Curve Comparison')
    plt.legend()
    plt.grid()
    plt.savefig('dueling_training_duration_comparison.png')
    plt.show()

    return {
        'standard': standard_metrics,
        'extended': extended_metrics,
        'standard_eval': standard_eval,
        'extended_eval': extended_eval
    }
In [78]:
if __name__ == "__main__":
    # Run complete Dueling DQN experiment
    dueling_results = compare_dueling_training_durations()
    
    # Enhanced evaluation of best model
    if dueling_results['extended_eval']['mean'] > dueling_results['standard_eval']['mean']:
        best_model = "dueling_dqn_best_weights.h5"
    else:
        best_model = "dueling_dqn_600ep_best_weights.h5"
    
    print(f"\nRunning enhanced evaluation on best model: {best_model}")
    enhanced_results = enhanced_evaluate_stability(best_model)
=== Training Dueling DQN (1000 episodes) ===

=== Training Dueling DQN (1000 episodes) ===
Random seed: 42
Episode   1 | Reward: -1330.80 | Avg(10): -1330.80 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.01s | Avg Q-val: 0.00 | Since Improv: 1
Episode   2 | Reward: -971.21 | Avg(10): -971.21 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.04s | Avg Q-val: 0.00 | Since Improv: 1
Episode   3 | Reward: -1701.14 | Avg(10): -1701.14 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.02s | Avg Q-val: 0.00 | Since Improv: 1
Episode   4 | Reward: -949.81 | Avg(10): -949.81 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 2
Episode   5 | Reward: -1000.46 | Avg(10): -1000.46 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.06s | Avg Q-val: 0.00 | Since Improv: 1
Episode   6 | Reward: -1239.04 | Avg(10): -1239.04 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.04s | Avg Q-val: 0.00 | Since Improv: 2
Episode   7 | Reward: -1755.29 | Avg(10): -1755.29 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.08s | Avg Q-val: 0.00 | Since Improv: 3
Episode   8 | Reward: -1308.09 | Avg(10): -1308.09 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.09s | Avg Q-val: 0.00 | Since Improv: 4
Episode   9 | Reward: -1476.96 | Avg(10): -1476.96 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.09s | Avg Q-val: 0.00 | Since Improv: 5
Episode  10 | Reward: -1538.53 | Avg(10): -1327.13 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.29s | Avg Q-val: 0.31 | Since Improv: 6
Episode  50 | Reward: -809.25 | Avg(10): -1181.30 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 10.48s | Avg Q-val: -44.08 | Since Improv: 46
Episode 100 | Reward: -724.07 | Avg(10): -776.74 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 10.97s | Avg Q-val: -70.65 | Since Improv: 11
Episode 150 | Reward: -505.99 | Avg(10): -497.85 | ε: 0.471 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 10.79s | Avg Q-val: -66.29 | Since Improv: 8
Episode 200 | Reward: -483.03 | Avg(10): -343.67 | ε: 0.367 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 13.48s | Avg Q-val: -45.13 | Since Improv: 1
Episode 250 | Reward: -427.75 | Avg(10): -516.58 | ε: 0.286 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 9.83s | Avg Q-val: -26.78 | Since Improv: 48
Episode 300 | Reward: -610.31 | Avg(10): -442.43 | ε: 0.222 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 10.35s | Avg Q-val: -10.52 | Since Improv: 98
Episode 350 | Reward: -502.95 | Avg(10): -451.21 | ε: 0.173 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 11.29s | Avg Q-val: -5.18 | Since Improv: 148
Episode 400 | Reward: -129.78 | Avg(10): -322.77 | ε: 0.135 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 15.18s | Avg Q-val: -5.74 | Since Improv: 198
Episode 450 | Reward: -504.94 | Avg(10): -403.84 | ε: 0.105 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 17.41s | Avg Q-val: -4.41 | Since Improv: 48
Episode 500 | Reward: -254.83 | Avg(10): -411.17 | ε: 0.082 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 15.65s | Avg Q-val: -5.39 | Since Improv: 98
Episode 550 | Reward: -589.42 | Avg(10): -330.37 | ε: 0.063 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 16.64s | Avg Q-val: 4.58 | Since Improv: 3
Episode 600 | Reward: -140.56 | Avg(10): -329.30 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 15.51s | Avg Q-val: 12.27 | Since Improv: 53
Episode 650 | Reward: -505.31 | Avg(10): -237.57 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 18.78s | Avg Q-val: 14.49 | Since Improv: 6
Episode 700 | Reward: -502.85 | Avg(10): -347.60 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 18.73s | Avg Q-val: 12.89 | Since Improv: 56
Episode 750 | Reward: -133.16 | Avg(10): -309.45 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 17.28s | Avg Q-val: 11.23 | Since Improv: 106
Episode 800 | Reward: -249.08 | Avg(10): -189.46 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 16.87s | Avg Q-val: 12.15 | Since Improv: 5
Episode 850 | Reward: -137.63 | Avg(10): -271.10 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 16.32s | Avg Q-val: 10.14 | Since Improv: 42
Episode 900 | Reward: -128.77 | Avg(10): -199.51 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 16.13s | Avg Q-val: 11.34 | Since Improv: 92
Episode 950 | Reward: -122.72 | Avg(10): -295.31 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 12.69s | Avg Q-val: 11.20 | Since Improv: 142
Episode 1000 | Reward: -385.28 | Avg(10): -450.82 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 13.78s | Avg Q-val: 9.20 | Since Improv: 192
Training complete. Best avg(10): -125.91 at episode 808

=== Evaluating Models ===

Evaluating: dueling_dqn_600ep_best_weights.h5
Loaded weights from dueling_dqn_600ep_best_weights.h5
Running 5 runs × 50 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -176.0 ± 120.5
--- Run 2/5 ---
Run 2: -159.6 ± 78.9
--- Run 3/5 ---
Run 3: -159.6 ± 98.2
--- Run 4/5 ---
Run 4: -164.6 ± 75.2
--- Run 5/5 ---
Run 5: -177.2 ± 93.9

EVALUATION SUMMARY:
Overall mean: -167.38
Overall std: 95.04
Run-to-run consistency: 7.74 (lower = more consistent)
--------------------------------------------------

Evaluating: dueling_dqn_best_weights.h5
Loaded weights from dueling_dqn_best_weights.h5
Running 5 runs × 50 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -132.4 ± 91.4
--- Run 2/5 ---
Run 2: -117.3 ± 88.1
--- Run 3/5 ---
Run 3: -179.6 ± 98.9
--- Run 4/5 ---
Run 4: -157.5 ± 83.0
--- Run 5/5 ---
Run 5: -158.7 ± 83.4

EVALUATION SUMMARY:
Overall mean: -149.10
Overall std: 91.81
Run-to-run consistency: 21.82 (lower = more consistent)
--------------------------------------------------

=== Training Duration Comparison Results ===
Metric                         | Standard (600)  | Extended (1000)
----------------------------------------------------------------------
Best Training Avg Reward       |         -274.61 |         -125.91
Final Avg Reward (50)          |         -353.83 |         -287.46
Total Training Time (hrs)      |            3.99 |            3.99

Evaluation Metrics:
Mean Reward                    |         -167.38 |         -149.10
Reward Std                     |           95.04 |           91.81
Run Consistency                |            7.74 |           21.82
No description has been provided for this image
Running enhanced evaluation on best model: dueling_dqn_best_weights.h5

Evaluating: dueling_dqn_best_weights.h5
Loaded weights from dueling_dqn_best_weights.h5
Running 5 runs × 50 episodes (epsilon=0.0)
--- Run 1/5 ---
Run 1: -156.3 ± 89.5
--- Run 2/5 ---
Run 2: -137.7 ± 77.2
--- Run 3/5 ---
Run 3: -141.0 ± 70.8
--- Run 4/5 ---
Run 4: -170.5 ± 88.7
--- Run 5/5 ---
Run 5: -154.8 ± 87.1

EVALUATION SUMMARY:
Overall mean: -152.05
Overall std: 83.81
Run-to-run consistency: 11.78 (lower = more consistent)
--------------------------------------------------

Enhanced Evaluation Metrics:
95% Confidence Interval: -152.05 ± 10.39
Reward Range: -408.10 to -1.34
No description has been provided for this image

Observations and analysis

  • Observations

    • Training Performance:

      • The agent's performance improved, but not consistently. The best average reward of -125.91 was achieved at episode 808, but the final average reward dropped significantly to -450.82, indicating a performance decay late in training.

      • The training log shows periods of improvement followed by plateaus or drops in performance. For example, the Since Improv count reaches 192, suggesting a long period without significant gains.

  • Evaluation Performance:

      - The overall mean reward of -149.10 is an improvement over your previous Double DQN model (-168.51), suggesting the dueling architecture helps the agent find a better policy on average.
    
      - The overall standard deviation of 91.81 is lower than the previous model's 94.92, indicating slightly more consistent performance within each evaluation run.
    
    
    

  • Stability (Run-to-Run Consistency):

    • The run-to-run consistency score of 21.82 is significantly worse than the Double DQN's score of 6.51. This is a major finding. While the average performance is better, the Dueling DQN is much less predictable across different evaluation runs. The range of mean rewards from -117.3 (Run 2) to -179.6 (Run 3) is wide.
In [101]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from collections import defaultdict

def compare_models(double_dqn_path, dueling_dqn_path, num_episodes=100, num_runs=5):
    """Compare Double DQN and Dueling DQN models with comprehensive analysis"""
    
    # Hyperparameters (must match training config)
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    GAMMA = 0.995
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print("\n=== MODEL COMPARISON ===")
    print(f"Double DQN: {double_dqn_path}")
    print(f"Dueling DQN: {dueling_dqn_path}")
    print(f"Evaluation episodes: {num_runs}x{num_episodes}")
    print("="*50)
    
    # Initialize agents
    double_agent = DoubleDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY,
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START,
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    dueling_agent = DuelingDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY,
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START,
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    # Load weights
    try:
        double_agent.load(double_dqn_path)
        double_agent.epsilon = 0.0  # Disable exploration
        print(f"Loaded Double DQN weights from {double_dqn_path}")
    except Exception as e:
        print(f"Error loading Double DQN: {e}")
        return None
        
    try:
        dueling_agent.load(dueling_dqn_path)
        dueling_agent.epsilon = 0.0  # Disable exploration
        print(f"Loaded Dueling DQN weights from {dueling_dqn_path}")
    except Exception as e:
        print(f"Error loading Dueling DQN: {e}")
        return None
    
    # Evaluation function
    def evaluate_agent(agent, num_episodes, num_runs):
        all_rewards = []
        run_stats = []
        
        for run in range(num_runs):
            env = gym.make('Pendulum-v0')
            run_rewards = []
            
            for ep in range(num_episodes):
                state = env.reset()
                if isinstance(state, tuple):
                    state = state[0]
                state = np.array(state, dtype=np.float32).flatten()[:3]
                
                total_reward = 0
                done = False
                
                while not done:
                    action = agent.select_action(state)
                    torque = action_index_to_torque(action, N_ACTIONS)
                    next_state, reward, done, _ = env.step([torque])
                    
                    if isinstance(next_state, tuple):
                        next_state = next_state[0]
                    next_state = np.array(next_state, dtype=np.float32).flatten()[:3]
                    
                    total_reward += reward
                    state = next_state
                
                run_rewards.append(total_reward)
            
            env.close()
            run_mean = np.mean(run_rewards)
            run_std = np.std(run_rewards)
            run_stats.append({'mean': run_mean, 'std': run_std})
            all_rewards.extend(run_rewards)
        
        # Calculate overall statistics
        overall_mean = np.mean(all_rewards)
        overall_std = np.std(all_rewards)
        run_consistency = np.std([r['mean'] for r in run_stats])
        
        return {
            'all_rewards': all_rewards,
            'run_stats': run_stats,
            'overall_mean': overall_mean,
            'overall_std': overall_std,
            'run_consistency': run_consistency
        }
    
    # Evaluate both models
    print("\nEvaluating Double DQN...")
    double_results = evaluate_agent(double_agent, num_episodes, num_runs)
    
    print("\nEvaluating Dueling DQN...")
    dueling_results = evaluate_agent(dueling_agent, num_episodes, num_runs)
    
    # Print comparison table
    print("\n=== COMPARISON RESULTS ===")
    print(f"{'Metric':<25} | {'Double DQN':<15} | {'Dueling DQN':<15} | {'Improvement':<15}")
    print("-"*70)
    print(f"{'Mean Reward':<25} | {double_results['overall_mean']:15.2f} | {dueling_results['overall_mean']:15.2f} | {dueling_results['overall_mean'] - double_results['overall_mean']:+.2f}")
    print(f"{'Reward Std':<25} | {double_results['overall_std']:15.2f} | {dueling_results['overall_std']:15.2f} | {double_results['overall_std'] - dueling_results['overall_std']:+.2f}")
    print(f"{'Run Consistency':<25} | {double_results['run_consistency']:15.2f} | {dueling_results['run_consistency']:15.2f} | {double_results['run_consistency'] - dueling_results['run_consistency']:+.2f}")
    
    # Calculate confidence intervals
    def calculate_ci(rewards):
        sem = np.std(rewards) / np.sqrt(len(rewards))
        return 1.96 * sem
    
    double_ci = calculate_ci(double_results['all_rewards'])
    dueling_ci = calculate_ci(dueling_results['all_rewards'])
    
    print("\n=== CONFIDENCE INTERVALS ===")
    print(f"Double DQN: {double_results['overall_mean']:.2f} ± {double_ci:.2f}")
    print(f"Dueling DQN: {dueling_results['overall_mean']:.2f} ± {dueling_ci:.2f}")
    
    # Plot reward distributions
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    plt.hist(double_results['all_rewards'], bins=20, color='blue', alpha=0.7)
    plt.axvline(double_results['overall_mean'], color='r', linestyle='dashed')
    plt.title(f'Double DQN (n={len(double_results["all_rewards"])})')
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    
    plt.subplot(1, 2, 2)
    plt.hist(dueling_results['all_rewards'], bins=20, color='green', alpha=0.7)
    plt.axvline(dueling_results['overall_mean'], color='r', linestyle='dashed')
    plt.title(f'Dueling DQN (n={len(dueling_results["all_rewards"])})')
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    
    plt.tight_layout()
    plt.savefig('model_comparison_distributions.png')
    plt.show()
    
    # Plot run-by-run comparison
    run_means = {
        'Double DQN': [r['mean'] for r in double_results['run_stats']],
        'Dueling DQN': [r['mean'] for r in dueling_results['run_stats']]
    }
    
    plt.figure(figsize=(10, 6))
    x = np.arange(num_runs)
    width = 0.35
    
    plt.bar(x - width/2, run_means['Double DQN'], width, label='Double DQN', color='blue')
    plt.bar(x + width/2, run_means['Dueling DQN'], width, label='Dueling DQN', color='green')
    
    plt.xlabel('Run')
    plt.ylabel('Mean Reward')
    plt.title('Run-by-Run Comparison')
    plt.xticks(x, [f'Run {i+1}' for i in range(num_runs)])
    plt.legend()
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig('run_comparison.png')
    plt.show()
    
    return {
        'double_dqn': double_results,
        'dueling_dqn': dueling_results,
        'improvement': dueling_results['overall_mean'] - double_results['overall_mean']
    }

if __name__ == "__main__":
    # Compare the models
    comparison_results = compare_models(
        double_dqn_path="double_dqn_best_weights.h5",
        dueling_dqn_path="dueling_dqn_best_weights.h5",
        num_episodes=50,
        num_runs=5
    )
    
    if comparison_results:
        improvement = comparison_results['improvement']
        print(f"\nOverall Improvement: {improvement:.2f}")
        if improvement > 0:
            print("Dueling DQN performs better than Double DQN")
        else:
            print("Double DQN performs better than Dueling DQN")
=== MODEL COMPARISON ===
Double DQN: double_dqn_best_weights.h5
Dueling DQN: dueling_dqn_best_weights.h5
Evaluation episodes: 5x50
==================================================
Loaded Double DQN weights from double_dqn_best_weights.h5
Loaded Dueling DQN weights from dueling_dqn_best_weights.h5

Evaluating Double DQN...

Evaluating Dueling DQN...

=== COMPARISON RESULTS ===
Metric                    | Double DQN      | Dueling DQN     | Improvement    
----------------------------------------------------------------------
Mean Reward               |         -170.83 |         -141.43 | +29.40
Reward Std                |           97.09 |           87.02 | +10.07
Run Consistency           |           14.05 |           12.50 | +1.55

=== CONFIDENCE INTERVALS ===
Double DQN: -170.83 ± 12.04
Dueling DQN: -141.43 ± 10.79
No description has been provided for this image
No description has been provided for this image
Overall Improvement: 29.40
Dueling DQN performs better than Double DQN

Observations and Analysis

  • The Dueling DQN has a much higher mean reward of -141.43 compared to the Double DQN's -170.83. This difference of +29.40 indicates that the Dueling architecture helped the agent learn a more effective policy that achieves better outcomes on average. The Reward Standard Deviation (Std) for the Dueling DQN is also lower (87.02 vs. 97.09), showing that within each evaluation run, the rewards are less spread out. This points to a more reliable policy.

  • The Dueling DQN also demonstrates superior stability. Its Run Consistency score of 12.50 is lower than the Double DQN's 14.05. A lower score here means the model's performance is more consistent across different evaluation runs, making its behavior more predictable.

  • The confidence intervals confirm these findings. The 95% confidence interval for Dueling DQN is -141.43 ± 10.7, which is entirely outside the interval for Double DQN, -170.83 ± 12.04. This means there is a high degree of statistical certainty that the Dueling DQN is a genuinely better model, not just a result of random chance.

Conclusion

  • The final Dueling DQN model is a clear success. It has overcome the performance and stability issues of the previous models. The Dueling architecture, when combined with the Double DQN update rule, has created a more powerful agent that is both high-performing and reliable. The systematic approach of first addressing instability (with Double DQN) and then improving the learning architecture (with Dueling DQN) has paid off, leading to a robust final model.

Improving model architecture further¶

Key Improvements for Final Dueling DQN Training

  1. Network Architecture

Increase capacity: Use 3 hidden layers: (256, 128, 64) with ReLU activations. This is the single most likely change to improve performance without risking instability.

  1. Training Regimen

Episodes: Train for 1200 episodes (as you requested). Save weights: Save at every 200 episodes (e.g., dueling_dqn_ep0200_weights.h5, ...), and always save final weights. Always save best-so-far weights as before.

  1. Hyperparameters

Learning rate: Lower to 1e-4 for greater stability with larger network. Other params: Keep your proven settings: batch size 64, gamma 0.995, epsilon decay, plateau_restart, etc.

  1. Evaluation

After training: Evaluate final weights, best-so-far weights, and all 200-episode checkpoints with the same robust evaluate_stability and enhanced_evaluate_stability used previously.

  1. Logging

Metrics: Continue logging all metrics for later analysis.

In [95]:
# Set seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

def action_index_to_torque(action_index, n_actions):
    """Convert action index to torque value"""
    return -2.0 + (action_index * 4.0) / (n_actions - 1)

class DuelingDQNAgent:
    def __init__(self, input_shape, n_actions, gamma, replay_memory_size, min_replay_memory,
                 batch_size, target_update_every, learning_rate, epsilon_start, epsilon_min, 
                 epsilon_decay, epsilon_strategy="plateau_restart"):
        self.input_shape = input_shape
        self.n_actions = n_actions
        self.gamma = gamma
        self.replay_memory_size = replay_memory_size
        self.min_replay_memory = min_replay_memory
        self.batch_size = batch_size
        self.target_update_every = target_update_every
        self.learning_rate = learning_rate
        self.epsilon = epsilon_start
        self.epsilon_start = epsilon_start
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.epsilon_strategy = epsilon_strategy
        
        self.memory = deque(maxlen=replay_memory_size)
        self.target_update_counter = 0
        self.performance_history = deque(maxlen=50)
        self.last_improvement_episode = 0
        self.plateau_threshold = 20
        self.q_values_history = deque(maxlen=1000)
        
        self.main_network = self._build_network()
        self.target_network = self._build_network()
        self.update_target()
        self.optimizer = Adam(learning_rate=learning_rate)
    
    def _build_network(self):
        inputs = Input(shape=(self.input_shape,))
        x = Dense(256, activation='relu')(inputs)
        x = Dense(128, activation='relu')(x)
        x = Dense(64, activation='relu')(x)
        value_stream = Dense(32, activation='relu')(x)
        value = Dense(1, activation='linear')(value_stream)
        advantage_stream = Dense(32, activation='relu')(x)
        advantage = Dense(self.n_actions, activation='linear')(advantage_stream)
        outputs = value + (advantage - tf.reduce_mean(advantage, axis=1, keepdims=True))
        return Model(inputs=inputs, outputs=outputs)
    
    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(0, self.n_actions)
        q_values = self.main_network(state.reshape(1, -1))
        return np.argmax(q_values[0])
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def train_step(self):
        if len(self.memory) < self.min_replay_memory:
            return
        batch = random.sample(self.memory, self.batch_size)
        states = np.array([transition[0] for transition in batch])
        actions = np.array([transition[1] for transition in batch])
        rewards = np.array([transition[2] for transition in batch])
        next_states = np.array([transition[3] for transition in batch])
        dones = np.array([transition[4] for transition in batch])
        next_q_values_main = self.main_network(next_states)
        best_actions = tf.argmax(next_q_values_main, axis=1)
        next_q_values_target = self.target_network(next_states)
        batch_indices = tf.range(self.batch_size, dtype=tf.int32)
        indices = tf.stack([batch_indices, tf.cast(best_actions, tf.int32)], axis=1)
        max_target_q_values = tf.gather_nd(next_q_values_target, indices)
        targets = rewards + (self.gamma * max_target_q_values * (1 - dones))
        with tf.GradientTape() as tape:
            q_values = self.main_network(states, training=True)
            q_values_for_actions = tf.reduce_sum(q_values * tf.one_hot(actions, self.n_actions), axis=1)
            loss = tf.reduce_mean(tf.square(targets - q_values_for_actions))
        gradients = tape.gradient(loss, self.main_network.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.main_network.trainable_variables))
        self.q_values_history.append(float(tf.reduce_mean(q_values)))
    
    def update_target(self):
        self.target_network.set_weights(self.main_network.get_weights())
    
    def adaptive_epsilon_decay(self, episode, recent_performance):
        if self.epsilon_strategy == "linear":
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        elif self.epsilon_strategy == "performance_based":
            self.performance_history.append(recent_performance)
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                if recent_avg > older_avg + 5:
                    decay_rate = 0.998
                    self.last_improvement_episode = episode
                else:
                    decay_rate = 0.992
                return max(self.epsilon_min, self.epsilon * decay_rate)
            else:
                return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        elif self.epsilon_strategy == "plateau_restart":
            self.performance_history.append(recent_performance)
            if len(self.performance_history) >= 20:
                recent_avg = np.mean(list(self.performance_history)[-10:])
                older_avg = np.mean(list(self.performance_history)[-20:-10])
                if recent_avg > older_avg + 5:
                    self.last_improvement_episode = episode
                episodes_since_improvement = episode - self.last_improvement_episode
                if episodes_since_improvement >= self.plateau_threshold:
                    print(f"Epsilon restart at episode {episode}: {self.epsilon:.3f} → {self.epsilon_start * 0.3:.3f}")
                    self.epsilon = self.epsilon_start * 0.3
                    self.last_improvement_episode = episode
                    return self.epsilon
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        elif self.epsilon_strategy == "high_exploration":
            epsilon_min_high = 0.15
            return max(epsilon_min_high, self.epsilon * 0.9995)
        else:
            return max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def decay_epsilon_advanced(self, episode, recent_performance):
        self.epsilon = self.adaptive_epsilon_decay(episode, recent_performance)
    
    def save(self, filepath):
        self.main_network.save_weights(filepath)
    
    def load(self, filepath):
        self.main_network.load_weights(filepath)
        self.update_target()
    
    def summary(self):
        self.main_network.summary()
In [96]:
def train_dueling_dqn_with_metrics(episodes=1200, save_prefix="dueling_dqn"):
    ENV_NAME = 'Pendulum-v0'
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_STEPS = 200
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 1e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    print(f"\n=== Training Dueling DQN ({episodes} episodes, deeper net, lr={LEARNING_RATE}) ===")
    print(f"Random seed: {SEED}")
    env = gym.make(ENV_NAME)
    try:
        env.reset(seed=SEED)
    except TypeError:
        pass
    agent = DuelingDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    metrics = {
        'episode': [],
        'reward': [],
        'avg10': [],
        'avg50': [],
        'epsilon': [],
        'memory': [],
        'q_values': [],
        'episode_steps': [],
        'episode_time': [],
        'since_improv': [],
    }
    best_avg_reward = -np.inf
    best_episode = 0
    total_training_steps = 0
    start_time = time.time()
    for ep in range(1, episodes + 1):
        ep_start = time.time()
        try:
            s = env.reset(seed=SEED + ep)
            if isinstance(s, tuple):
                s = s[0]
        except TypeError:
            env.seed(SEED + ep)
            s = env.reset()
        s = np.asarray(s, dtype=np.float32).flatten()[:3]
        total_reward = 0
        episode_training_steps = 0
        for t in range(MAX_STEPS):
            a_idx = agent.select_action(s)
            torque = action_index_to_torque(a_idx, N_ACTIONS)
            s_next, r, done, *info = env.step([torque])
            s_next = s_next[0] if isinstance(s_next, tuple) else s_next
            s_next = np.asarray(s_next, dtype=np.float32).flatten()[:3]
            agent.remember(s, a_idx, r, s_next, done)
            if len(agent.memory) >= MIN_REPLAY_MEMORY:
                agent.train_step()
                total_training_steps += 1
                episode_training_steps += 1
            s = s_next
            total_reward += r
            if done:
                break
        agent.performance_history.append(total_reward)
        recent_performance = np.mean(list(agent.performance_history)[-10:]) if len(agent.performance_history) >= 10 else total_reward
        agent.decay_epsilon_advanced(ep, recent_performance)
        if ep % TARGET_UPDATE_EVERY == 0:
            agent.update_target()
        avg10 = np.mean(metrics['reward'][-9:] + [total_reward]) if len(metrics['reward']) >= 9 else np.mean([total_reward])
        avg50 = np.mean(metrics['reward'][-49:] + [total_reward]) if len(metrics['reward']) >= 49 else np.mean([total_reward])
        avg_q = np.mean(list(agent.q_values_history)[-100:]) if agent.q_values_history else 0
        since_improv = ep - best_episode
        metrics['episode'].append(ep)
        metrics['reward'].append(total_reward)
        metrics['avg10'].append(avg10)
        metrics['avg50'].append(avg50)
        metrics['epsilon'].append(agent.epsilon)
        metrics['memory'].append(len(agent.memory))
        metrics['q_values'].append(avg_q)
        metrics['episode_steps'].append(episode_training_steps)
        metrics['episode_time'].append(time.time() - ep_start)
        metrics['since_improv'].append(since_improv)
        if avg10 > best_avg_reward:
            best_avg_reward = avg10
            best_episode = ep
            agent.save(f"{save_prefix}_best_weights.h5")
        if ep % 200 == 0 or ep == episodes:
            agent.save(f"{save_prefix}_ep{ep:04d}_weights.h5")
        if ep <= 600:
            if ep == 1 or avg10 > (metrics.get('best_avg_reward_600', -np.inf)):
                agent.save(f"{save_prefix}_600ep_best_weights.h5")
                metrics['best_avg_reward_600'] = avg10
                metrics['best_episode_600'] = ep
        if ep <= 10 or ep % 50 == 0 or ep in [100, 200, 300, 400, 500, 600, 800, 1000, 1200]:
            print(f"Episode {ep:4d} | Reward: {total_reward:7.2f} | Avg(10): {avg10:7.2f} | "
                  f"ε: {agent.epsilon:.3f} | Memory: {len(agent.memory):,} ({len(agent.memory)/REPLAY_MEMORY_SIZE:.1%}) | "
                  f"Steps: {episode_training_steps} | Time: {metrics['episode_time'][-1]:.2f}s | "
                  f"Avg Q-val: {avg_q:.2f} | Since Improv: {since_improv}")
    env.close()
    training_time = time.time() - start_time
    pd.DataFrame(metrics).to_csv(f"{save_prefix}_metrics.csv", index=False)
    print(f"Training complete. Best avg(10): {best_avg_reward:.2f} at episode {best_episode}")
    print(f"Final weights saved as: {save_prefix}_ep{episodes:04d}_weights.h5")
    return {
        'agent': agent,
        'metrics': metrics,
        'total_time': training_time,
        'best_avg_reward': best_avg_reward,
        'best_episode': best_episode,
        'total_training_steps': total_training_steps,
    }
In [97]:
def evaluate_stability(weights_path, num_episodes=50, num_runs=5):
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    MAX_STEPS = 200
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 1e-4
    GAMMA = 0.995
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    agent = DuelingDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY, 
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START, 
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    try:
        agent.load(weights_path)
        agent.epsilon = 0.0
    except Exception as e:
        print(f"ERROR loading weights: {e}")
        return None
    all_run_results = []
    all_rewards = []
    for run in range(num_runs):
        env = gym.make('Pendulum-v0')
        run_rewards = []
        for ep in range(num_episodes):
            state = env.reset()
            if isinstance(state, tuple):
                state = state[0]
            state = np.array(state, dtype=np.float32)
            if state.shape != (3,):
                state = state.flatten()[:3]
            total_reward = 0
            for t in range(MAX_STEPS):
                a_idx = agent.select_action(state)
                torque = action_index_to_torque(a_idx, N_ACTIONS)
                next_state, reward, done, *_ = env.step([torque])
                if isinstance(next_state, tuple):
                    next_state = next_state[0]
                next_state = np.array(next_state, dtype=np.float32)
                if next_state.shape != (3,):
                    next_state = next_state.flatten()[:3]
                total_reward += reward
                state = next_state
                if done:
                    break
            run_rewards.append(total_reward)
        env.close()
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        all_run_results.append({'mean': run_mean, 'std': run_std, 'rewards': run_rewards})
        all_rewards.extend(run_rewards)
    all_means = [run['mean'] for run in all_run_results]
    overall_mean = np.mean(all_rewards)
    overall_std = np.std(all_rewards)
    run_consistency = np.std(all_means)
    return {
        'mean': overall_mean,
        'std': overall_std,
        'run_consistency': run_consistency,
        'all_rewards': all_rewards,
        'num_runs': num_runs,
        'num_episodes': num_episodes
    }

def enhanced_evaluate_stability(weights_path, num_episodes=50, num_runs=5):
    results = evaluate_stability(weights_path, num_episodes, num_runs)
    if not results:
        print(f"Could not evaluate {weights_path}")
        return None
    sem = results['std'] / np.sqrt(len(results['all_rewards']))
    ci_width = 1.96 * sem
    ci_95 = (results['mean'] - ci_width, results['mean'] + ci_width)
    print("\nEnhanced Evaluation Metrics:")
    print(f"95% Confidence Interval: {results['mean']:.2f} ± {ci_width:.2f}")
    print(f"Reward Range: {np.min(results['all_rewards']):.2f} to {np.max(results['all_rewards']):.2f}")
    plt.figure(figsize=(10, 5))
    plt.hist(results['all_rewards'], bins=20, color='blue', alpha=0.7)
    plt.axvline(results['mean'], color='r', linestyle='dashed', linewidth=1)
    plt.title(f"Reward Distribution (n={len(results['all_rewards'])})")
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    plt.savefig(f'{weights_path}_reward_dist.png')
    plt.show()
    results['ci_95'] = ci_95
    return results
In [98]:
def evaluate_all_checkpoints(
    checkpoint_prefix="dueling_dqn",
    episodes=1200,
    interval=200,
    extra_checkpoints=["dueling_dqn_best_weights.h5", "dueling_dqn_600ep_best_weights.h5"],
    eval_episodes=50,
    eval_runs=5
):
    checkpoints = [f"{checkpoint_prefix}_ep{ep:04d}_weights.h5" for ep in range(interval, episodes+1, interval)]
    checkpoints += extra_checkpoints
    checkpoints = [c for c in checkpoints if os.path.exists(c)]
    results_table = []
    all_results = {}
    print("\n=== Evaluation of all checkpoints ===")
    for ckpt in checkpoints:
        print(f"\nEvaluating: {ckpt}")
        result = enhanced_evaluate_stability(ckpt, num_episodes=eval_episodes, num_runs=eval_runs)
        if result is not None:
            mean = result['mean']
            std = result['std']
            ci_low, ci_high = result['ci_95']
            reward_min = np.min(result['all_rewards'])
            reward_max = np.max(result['all_rewards'])
            results_table.append([ckpt, mean, std, ci_low, ci_high, reward_min, reward_max])
            all_results[ckpt] = result
        else:
            results_table.append([ckpt, None, None, None, None, None, None])
    print("\nSummary of Evaluation (sorted by mean reward):")
    print(f"{'Checkpoint':<35} | {'Mean':>8} | {'Std':>8} | {'CI95-':>8} | {'CI95+':>8} | {'Min':>8} | {'Max':>8}")
    print("-"*90)
    for row in sorted(results_table, key=lambda x: x[1] if x[1] is not None else -np.inf, reverse=True):
        printable_row = [os.path.basename(row[0])]
        for val in row[1:]:
            if val is None:
                printable_row.append("   N/A   ")
            else:
                printable_row.append(f"{val:8.2f}")
        print(" | ".join([f"{v:<35}" if i == 0 else v for i, v in enumerate(printable_row)]))
    labeled_ckpts = [os.path.basename(r[0]) for r in results_table]
    mean_rewards = [r[1] for r in results_table]
    plt.figure(figsize=(12,5))
    plt.plot(labeled_ckpts, mean_rewards, marker='o')
    plt.title("Mean Evaluation Reward vs. Checkpoint")
    plt.ylabel("Mean Reward")
    plt.xlabel("Checkpoint")
    plt.xticks(rotation=30)
    plt.tight_layout()
    plt.savefig("dueling_dqn_all_checkpoints_eval.png")
    plt.show()
    return all_results
In [99]:
if __name__ == "__main__":
    train_dueling_dqn_with_metrics(episodes=1200, save_prefix="dueling_dqn")
    evaluate_all_checkpoints(
        checkpoint_prefix="dueling_dqn",
        episodes=1200,
        interval=200,
        extra_checkpoints=["dueling_dqn_best_weights.h5", "dueling_dqn_600ep_best_weights.h5"],
        eval_episodes=50,
        eval_runs=5
    )
=== Training Dueling DQN (1200 episodes, deeper net, lr=0.0001) ===
Random seed: 42
Episode    1 | Reward: -1330.80 | Avg(10): -1330.80 | ε: 0.995 | Memory: 200 (0.2%) | Steps: 0 | Time: 0.02s | Avg Q-val: 0.00 | Since Improv: 1
Episode    2 | Reward: -973.97 | Avg(10): -973.97 | ε: 0.990 | Memory: 400 (0.4%) | Steps: 0 | Time: 0.05s | Avg Q-val: 0.00 | Since Improv: 1
Episode    3 | Reward: -1701.14 | Avg(10): -1701.14 | ε: 0.985 | Memory: 600 (0.6%) | Steps: 0 | Time: 0.01s | Avg Q-val: 0.00 | Since Improv: 1
Episode    4 | Reward: -882.50 | Avg(10): -882.50 | ε: 0.980 | Memory: 800 (0.8%) | Steps: 0 | Time: 0.03s | Avg Q-val: 0.00 | Since Improv: 2
Episode    5 | Reward: -995.21 | Avg(10): -995.21 | ε: 0.975 | Memory: 1,000 (1.0%) | Steps: 0 | Time: 0.12s | Avg Q-val: 0.00 | Since Improv: 1
Episode    6 | Reward: -1229.71 | Avg(10): -1229.71 | ε: 0.970 | Memory: 1,200 (1.2%) | Steps: 0 | Time: 0.04s | Avg Q-val: 0.00 | Since Improv: 2
Episode    7 | Reward: -1744.80 | Avg(10): -1744.80 | ε: 0.966 | Memory: 1,400 (1.4%) | Steps: 0 | Time: 0.08s | Avg Q-val: 0.00 | Since Improv: 3
Episode    8 | Reward: -1292.61 | Avg(10): -1292.61 | ε: 0.961 | Memory: 1,600 (1.6%) | Steps: 0 | Time: 0.05s | Avg Q-val: 0.00 | Since Improv: 4
Episode    9 | Reward: -1459.42 | Avg(10): -1459.42 | ε: 0.956 | Memory: 1,800 (1.8%) | Steps: 0 | Time: 0.09s | Avg Q-val: 0.00 | Since Improv: 5
Episode   10 | Reward: -1534.12 | Avg(10): -1314.43 | ε: 0.951 | Memory: 2,000 (2.0%) | Steps: 1 | Time: 0.30s | Avg Q-val: 0.00 | Since Improv: 6
Episode   50 | Reward: -942.08 | Avg(10): -1191.61 | ε: 0.778 | Memory: 10,000 (10.0%) | Steps: 200 | Time: 19.12s | Avg Q-val: -46.45 | Since Improv: 46
Episode  100 | Reward: -729.50 | Avg(10): -675.39 | ε: 0.606 | Memory: 20,000 (20.0%) | Steps: 200 | Time: 19.39s | Avg Q-val: -71.28 | Since Improv: 2
Episode  150 | Reward: -254.23 | Avg(10): -407.64 | ε: 0.471 | Memory: 30,000 (30.0%) | Steps: 200 | Time: 18.73s | Avg Q-val: -66.22 | Since Improv: 5
Episode  200 | Reward: -239.60 | Avg(10): -241.97 | ε: 0.367 | Memory: 40,000 (40.0%) | Steps: 200 | Time: 19.16s | Avg Q-val: -44.79 | Since Improv: 1
Episode  250 | Reward: -250.69 | Avg(10): -317.74 | ε: 0.286 | Memory: 50,000 (50.0%) | Steps: 200 | Time: 19.31s | Avg Q-val: -25.08 | Since Improv: 48
Episode  300 | Reward: -612.66 | Avg(10): -381.13 | ε: 0.222 | Memory: 60,000 (60.0%) | Steps: 200 | Time: 18.99s | Avg Q-val: -10.09 | Since Improv: 98
Episode  350 | Reward: -503.58 | Avg(10): -551.96 | ε: 0.173 | Memory: 70,000 (70.0%) | Steps: 200 | Time: 20.25s | Avg Q-val: -4.71 | Since Improv: 148
Episode  400 | Reward: -295.13 | Avg(10): -381.44 | ε: 0.135 | Memory: 80,000 (80.0%) | Steps: 200 | Time: 20.83s | Avg Q-val: -5.33 | Since Improv: 198
Episode  450 | Reward: -389.75 | Avg(10): -347.74 | ε: 0.105 | Memory: 90,000 (90.0%) | Steps: 200 | Time: 20.52s | Avg Q-val: -7.47 | Since Improv: 248
Episode  500 | Reward: -259.17 | Avg(10): -305.19 | ε: 0.082 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 23.89s | Avg Q-val: -9.00 | Since Improv: 298
Episode  550 | Reward: -361.35 | Avg(10): -327.68 | ε: 0.063 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 24.65s | Avg Q-val: 0.97 | Since Improv: 348
Episode  600 | Reward:  -15.58 | Avg(10): -219.14 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 72.10s | Avg Q-val: 6.91 | Since Improv: 398
Episode  650 | Reward: -261.79 | Avg(10): -209.15 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 17.56s | Avg Q-val: 6.47 | Since Improv: 7
Episode  700 | Reward: -130.70 | Avg(10): -276.43 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 16.28s | Avg Q-val: 7.71 | Since Improv: 57
Episode  750 | Reward: -123.51 | Avg(10): -184.86 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 15.01s | Avg Q-val: 13.69 | Since Improv: 8
Episode  800 | Reward: -337.78 | Avg(10): -191.43 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 16.88s | Avg Q-val: 23.04 | Since Improv: 4
Episode  850 | Reward:  -15.75 | Avg(10): -233.41 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 18.31s | Avg Q-val: 30.31 | Since Improv: 40
Episode  900 | Reward: -126.47 | Avg(10): -212.45 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 17.27s | Avg Q-val: 33.96 | Since Improv: 90
Episode  950 | Reward: -127.92 | Avg(10): -240.05 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 15.92s | Avg Q-val: 32.79 | Since Improv: 140
Episode 1000 | Reward: -262.57 | Avg(10): -314.10 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 15.82s | Avg Q-val: 31.14 | Since Improv: 190
Episode 1050 | Reward: -380.26 | Avg(10): -322.56 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 14.38s | Avg Q-val: 28.45 | Since Improv: 240
Episode 1100 | Reward: -229.63 | Avg(10): -160.05 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 15.09s | Avg Q-val: 26.89 | Since Improv: 290
Episode 1150 | Reward: -256.65 | Avg(10): -269.39 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 14.44s | Avg Q-val: 24.69 | Since Improv: 340
Episode 1200 | Reward: -262.38 | Avg(10): -252.99 | ε: 0.050 | Memory: 100,000 (100.0%) | Steps: 200 | Time: 14.58s | Avg Q-val: 21.81 | Since Improv: 390
Training complete. Best avg(10): -105.61 at episode 810
Final weights saved as: dueling_dqn_ep1200_weights.h5

=== Evaluation of all checkpoints ===

Evaluating: dueling_dqn_ep0200_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -148.82 ± 10.44
Reward Range: -389.08 to -0.36
No description has been provided for this image
Evaluating: dueling_dqn_ep0400_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -166.75 ± 11.09
Reward Range: -400.28 to -14.13
No description has been provided for this image
Evaluating: dueling_dqn_ep0600_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -159.66 ± 10.37
Reward Range: -386.77 to -11.36
No description has been provided for this image
Evaluating: dueling_dqn_ep0800_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -145.44 ± 10.39
Reward Range: -366.37 to -1.02
No description has been provided for this image
Evaluating: dueling_dqn_ep1000_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -157.46 ± 10.21
Reward Range: -377.70 to -4.40
No description has been provided for this image
Evaluating: dueling_dqn_ep1200_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -164.29 ± 10.27
Reward Range: -384.94 to -18.22
No description has been provided for this image
Evaluating: dueling_dqn_best_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -149.40 ± 10.55
Reward Range: -373.07 to -0.79
No description has been provided for this image
Evaluating: dueling_dqn_600ep_best_weights.h5

Enhanced Evaluation Metrics:
95% Confidence Interval: -143.22 ± 9.96
Reward Range: -369.55 to -0.70
No description has been provided for this image
Summary of Evaluation (sorted by mean reward):
Checkpoint                          |     Mean |      Std |    CI95- |    CI95+ |      Min |      Max
------------------------------------------------------------------------------------------
dueling_dqn_600ep_best_weights.h5   |  -143.22 |    80.36 |  -153.18 |  -133.26 |  -369.55 |    -0.70
dueling_dqn_ep0800_weights.h5       |  -145.44 |    83.85 |  -155.84 |  -135.05 |  -366.37 |    -1.02
dueling_dqn_ep0200_weights.h5       |  -148.82 |    84.18 |  -159.25 |  -138.38 |  -389.08 |    -0.36
dueling_dqn_best_weights.h5         |  -149.40 |    85.10 |  -159.94 |  -138.85 |  -373.07 |    -0.79
dueling_dqn_ep1000_weights.h5       |  -157.46 |    82.39 |  -167.68 |  -147.25 |  -377.70 |    -4.40
dueling_dqn_ep0600_weights.h5       |  -159.66 |    83.67 |  -170.03 |  -149.29 |  -386.77 |   -11.36
dueling_dqn_ep1200_weights.h5       |  -164.29 |    82.84 |  -174.56 |  -154.03 |  -384.94 |   -18.22
dueling_dqn_ep0400_weights.h5       |  -166.75 |    89.47 |  -177.84 |  -155.66 |  -400.28 |   -14.13
No description has been provided for this image

Observations and Analysis

  1. Performance Regression:
  • The new model's best evaluation performance (dueling_dqn_best_weights.h5) has a mean reward of -149.40.
  • The final model from the previous experiment (which we identified as a Dueling Double DQN) had a mean reward of -141.43.
  • This represents a performance decrease of about 8 points, which is significant.

  1. Increased Stability (but not enough to compensate for the performance drop):
  • The dueling_dqn_best_weights.h5 from this new run has a narrower 95% confidence interval and a smaller reward range.
  • Previous model's best weights: CI (-152.0, -130.8)
  • New model's best weights: CI (-159.95, -138.85)

The new model is indeed slightly more consistent. However, the drop in its average performance makes this increased stability less valuable.


  1. Training Trajectory is Telling:
  • The training logs for the new model show a best average reward of -105.61 at episode 810. This is a much higher value than your previous Dueling Double DQN's best training average. Yet, this high training reward did not translate to a better evaluation score.

Why this happened?

There are a few key reasons why a larger network might not perform as well, even with a lower learning rate:

  • Overfitting: A deeper network has more parameters and thus more capacity to memorize specific training examples rather than learning generalizable policies. The new model's high training reward that doesn't translate to a high evaluation reward is a classic sign of overfitting. It got good at the training episodes, but its policy doesn't work as well in the unseen evaluation episodes.

  • Learning Rate Mismatch: While a lower learning rate (1e-4) is often good for stability in deep networks, it might not be the optimal rate for this specific problem. The agent may not have learned efficiently enough to fully leverage the new network's capacity.

  • Plateauing Issues: Despite the longer training duration, the agent's performance peaked around episode 810 and then degraded. This suggests that the model became stuck in a local optimum or began to overfit after that point. The extra training episodes were unproductive.

Overall conclusion¶

  • The "Dueling Double DQN" from your previous experiment (which had a simpler network) was the superior model.

  • This final experiment provides a crucial lesson: Simply increasing network capacity and training duration does not guarantee better performance in reinforcement learning. In fact, it can lead to overfitting and a less effective policy.

  • The best approach is to find the right balance between model complexity, learning rate, and training duration for a given problem. My previous model found that balance better than this final, larger mode

BEST MODEL

In [102]:
import numpy as np
import tensorflow as tf
import gym
import matplotlib.pyplot as plt
from collections import deque

def final_evaluation(weights_path, num_episodes=100, num_runs=5):
    """Comprehensive evaluation of the best model with visualizations and statistics"""
    
    # Hyperparameters (must match training)
    INPUT_SHAPE = 3
    N_ACTIONS = 21
    GAMMA = 0.995
    REPLAY_MEMORY_SIZE = 100000
    MIN_REPLAY_MEMORY = 2000
    BATCH_SIZE = 64
    TARGET_UPDATE_EVERY = 5
    LEARNING_RATE = 3e-4
    EPSILON_START = 1.0
    EPSILON_MIN = 0.05
    EPSILON_DECAY = 0.995
    EPSILON_STRATEGY = "plateau_restart"
    
    print(f"\n=== FINAL EVALUATION OF BEST MODEL ===")
    print(f"Model: {weights_path}")
    print(f"Evaluation episodes: {num_runs}x{num_episodes}")
    print("="*60)
    
    # Initialize agent
    agent = DuelingDQNAgent(
        INPUT_SHAPE, N_ACTIONS, GAMMA, REPLAY_MEMORY_SIZE, MIN_REPLAY_MEMORY,
        BATCH_SIZE, TARGET_UPDATE_EVERY, LEARNING_RATE, EPSILON_START,
        EPSILON_MIN, EPSILON_DECAY, epsilon_strategy=EPSILON_STRATEGY
    )
    
    # Load weights
    try:
        agent.load(weights_path)
        agent.epsilon = 0.0  # Pure exploitation
        print(f"Successfully loaded weights from {weights_path}")
    except Exception as e:
        print(f"Error loading weights: {e}")
        return None
    
    # Track detailed evaluation metrics
    all_rewards = []
    run_metrics = {
        'means': [],
        'stds': [],
        'mins': [],
        'maxs': [],
        'median': []
    }
    
    # Run evaluation
    for run in range(num_runs):
        env = gym.make('Pendulum-v0')
        run_rewards = []
        
        for ep in range(num_episodes):
            state = env.reset()
            if isinstance(state, tuple):
                state = state[0]
            state = np.array(state, dtype=np.float32).flatten()[:3]
            
            total_reward = 0
            done = False
            
            while not done:
                action = agent.select_action(state)
                torque = action_index_to_torque(action, N_ACTIONS)
                next_state, reward, done, _ = env.step([torque])
                
                if isinstance(next_state, tuple):
                    next_state = next_state[0]
                next_state = np.array(next_state, dtype=np.float32).flatten()[:3]
                
                total_reward += reward
                state = next_state
            
            run_rewards.append(total_reward)
        
        env.close()
        
        # Calculate run statistics
        run_mean = np.mean(run_rewards)
        run_std = np.std(run_rewards)
        run_min = np.min(run_rewards)
        run_max = np.max(run_rewards)
        run_median = np.median(run_rewards)
        
        run_metrics['means'].append(run_mean)
        run_metrics['stds'].append(run_std)
        run_metrics['mins'].append(run_min)
        run_metrics['maxs'].append(run_max)
        run_metrics['median'].append(run_median)
        
        all_rewards.extend(run_rewards)
        
        print(f"Run {run+1}/{num_runs}: Mean = {run_mean:.1f} ± {run_std:.1f} | "
              f"Range = [{run_min:.1f}, {run_max:.1f}] | Median = {run_median:.1f}")
    
    # Calculate overall statistics
    overall_mean = np.mean(all_rewards)
    overall_std = np.std(all_rewards)
    overall_min = np.min(all_rewards)
    overall_max = np.max(all_rewards)
    overall_median = np.median(all_rewards)
    run_consistency = np.std(run_metrics['means'])  # Std of run means
    
    # Confidence interval
    sem = overall_std / np.sqrt(len(all_rewards))
    ci_width = 1.96 * sem
    
    print("\n=== FINAL PERFORMANCE SUMMARY ===")
    print(f"Total episodes evaluated: {len(all_rewards)}")
    print(f"Overall mean reward: {overall_mean:.2f} ± {ci_width:.2f} (95% CI)")
    print(f"Reward std: {overall_std:.2f}")
    print(f"Reward range: [{overall_min:.2f}, {overall_max:.2f}]")
    print(f"Median reward: {overall_median:.2f}")
    print(f"Run-to-run consistency (std of means): {run_consistency:.2f}")
    
    # Plotting
    plt.figure(figsize=(15, 5))
    
    # Reward distribution
    plt.subplot(1, 3, 1)
    plt.hist(all_rewards, bins=20, color='green', alpha=0.7)
    plt.axvline(overall_mean, color='r', linestyle='dashed', linewidth=1)
    plt.title(f'Reward Distribution (n={len(all_rewards)})')
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    
    # Run-by-run performance
    plt.subplot(1, 3, 2)
    x = np.arange(num_runs)
    plt.bar(x, run_metrics['means'], yerr=run_metrics['stds'], 
            color='blue', alpha=0.7, capsize=5)
    plt.xticks(x, [f'Run {i+1}' for i in range(num_runs)])
    plt.title('Performance Across Runs')
    plt.xlabel('Run')
    plt.ylabel('Mean Reward')
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    
    # Metrics comparison
    plt.subplot(1, 3, 3)
    metrics = ['Mean', 'Median', 'Min', 'Max']
    values = [overall_mean, overall_median, overall_min, overall_max]
    plt.bar(metrics, values, color=['blue', 'green', 'red', 'purple'])
    plt.title('Key Performance Metrics')
    plt.ylabel('Reward Value')
    for i, v in enumerate(values):
        plt.text(i, v, f"{v:.1f}", ha='center', va='bottom')
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.savefig('final_model_performance.png')
    plt.show()
    
    return {
        'all_rewards': all_rewards,
        'overall_mean': overall_mean,
        'overall_std': overall_std,
        'ci_95': (overall_mean - ci_width, overall_mean + ci_width),
        'min': overall_min,
        'max': overall_max,
        'median': overall_median,
        'run_consistency': run_consistency,
        'run_metrics': run_metrics
    }
In [103]:
if __name__ == "__main__":
    # Evaluate the best model
    best_model_path = "dueling_dqn_best_weights.h5"
    evaluation_results = final_evaluation(best_model_path)
    
    if evaluation_results:
        print("\n=== FINAL VERDICT ===")
        print(f"Model {best_model_path} has been thoroughly evaluated with:")
        print(f"- {len(evaluation_results['all_rewards'])} total episodes")
        print(f"- Consistent performance across runs (σ = {evaluation_results['run_consistency']:.2f})")
        print(f"- 95% confidence interval: {evaluation_results['ci_95'][0]:.2f} to {evaluation_results['ci_95'][1]:.2f}")
        print("\nThis represents the best performance achieved in your experiments.")
=== FINAL EVALUATION OF BEST MODEL ===
Model: dueling_dqn_best_weights.h5
Evaluation episodes: 5x100
============================================================
Successfully loaded weights from dueling_dqn_best_weights.h5
Run 1/5: Mean = -151.3 ± 87.3 | Range = [-359.3, -0.9] | Median = -123.4
Run 2/5: Mean = -151.1 ± 86.0 | Range = [-361.6, -0.9] | Median = -122.6
Run 3/5: Mean = -147.2 ± 80.1 | Range = [-356.2, -0.8] | Median = -122.5
Run 4/5: Mean = -146.7 ± 81.5 | Range = [-360.2, -1.0] | Median = -123.2
Run 5/5: Mean = -140.8 ± 83.8 | Range = [-363.5, -0.8] | Median = -122.4

=== FINAL PERFORMANCE SUMMARY ===
Total episodes evaluated: 500
Overall mean reward: -147.42 ± 7.35 (95% CI)
Reward std: 83.88
Reward range: [-363.52, -0.79]
Median reward: -122.78
Run-to-run consistency (std of means): 3.80
No description has been provided for this image
=== FINAL VERDICT ===
Model dueling_dqn_best_weights.h5 has been thoroughly evaluated with:
- 500 total episodes
- Consistent performance across runs (σ = 3.80)
- 95% confidence interval: -154.78 to -140.07

This represents the best performance achieved in your experiments.

Learning Point: ¶

  • My initial hyperparameter tuning process, while extensive, had a significant flaw: it was a univariate analysis. I tuned each hyperparameter (e.g., n_actions, replay memory size, epsilon strategy) in isolation, assuming they were independent. I failed to systematically investigate their interactions, which is crucial in reinforcement learning where parameters are highly coupled. For example, the optimal learning rate is likely to be different for a larger network than for a smaller one.

  • A key takeaway was the importance of establishing a stable and robust baseline. My initial results were volatile, but by identifying and implementing the plateau_restart epsilon strategy and extending the training duration to 1000 episodes, I achieved a much more consistent model. This allowed me to confidently evaluate the impact of subsequent improvements, as I was comparing each new technique to a 'best effort' baseline rather than a noisy, unreliable one.

  • My progression was a methodical application of advanced RL techniques. I first addressed the issue of overestimation bias by implementing Double DQN, which successfully improved the stability of the agent's value estimates. I then focused on improving the agent's learning efficiency and generalization by adopting the Dueling DQN architecture. The combination of these two techniques (Dueling Double DQN) proved to be the most effective, as it balanced the stability of Double DQN with the performance gains of the Dueling architecture

  • My final experiment taught me that more is not always better. Increasing the network's capacity and training for a longer duration did not yield a better model. In fact, it led to a performance decrease and a strong indication of overfitting. The model became too complex and began to memorize specific training scenarios rather than learning a robust, generalizable policy. This highlights that model capacity must be carefully balanced with the complexity of the task and the amount of data available.


In this project, I did not focus on just one criterion but rather on a strategic balance between learning speed, stability, and final performance. My methodology evolved to reflect this, moving from an initial focus on raw performance to a more nuanced approach.

Initially, my goal was simply to achieve the highest possible reward, as is common in many reinforcement learning tasks. However, early experiments revealed that a model could achieve a high reward in one run and fail completely in another. This led to a significant shift in my criteria.

I concluded that most stable learning was the most critical objective. A stable agent is a predictable and reliable one. My research progression directly reflects this:

  • Establishing a Stable Baseline: I first focused on reducing the variance of my agent. Techniques like using a larger replay memory and an adaptive plateau_restart epsilon strategy were specifically chosen to create a more consistent and reliable learning process. This allowed me to form a solid foundation before pursuing higher performance.

  • Balancing Stability with Performance: My next step was to introduce the Double DQN algorithm, which directly tackles the problem of Q-value overestimation, a known source of instability. This successfully improved the agent's consistency (as seen by a lower run-to-run consistency score) but did not dramatically improve the mean reward. This confirmed that I had a stable but suboptimal agent.

  • Prioritizing Performance on a Stable Foundation: With stability secured, I then introduced the Dueling DQN architecture. This modification was aimed at improving the agent's learning efficiency and ultimately its performance. The results showed that this approach successfully increased the average reward significantly while maintaining a reasonable level of stability.

  • My final model, a Dueling Double DQN, represents the culmination of this strategy. It is not the fastest to train, but it achieves a superior balance of high performance and high stability, making it the most robust and reliable solution for this problem. The final experiment with a larger network showed a decrease in performance, reinforcing my conclusion that a simple increase in network size is not a substitute for a well-thought-out, balanced approach to stability and performance.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: